1 Introduction

Dynamic stochastic optimization models have a long history in operations research, with applications in many different areas. Such models represent a sequential decision-making process whereby information is revealed in stages, and decisions are made based on the information available up to that point. There are multiple ways to solve dynamic stochastic optimization models. The choice for a particular approach depends, of course, on the structure of the problem under study. Whatever the solution method being used, an assumption often made in the literature is that the distribution of the underlying stochastic process representing the uncertainty is known, perhaps by fitting a distribution (or moments) to available data, or by constructing a scenario tree from subjective probabilities, or by postulating a model according to the type of problem as in the case of financial models based on stochastic differential equations, to name a few examples. In some cases, even more sophisticated simulators can be used as long as they can generate random samples that accurately characterizes uncertainty.

Our goal in this paper is to depart from the common assumption of known distributions and to model the uncertainty directly from the data. This is accomplished by applying a machine learning approach to the problem. More specifically, we use a Hidden Markov Model (HMM) to learn the structure of the data. The HMM can indeed be viewed as a machine learning method in that it classifies the data according to unobservable states, and then estimates the transition probabilities between each pair of states. Since each (unobservable) state corresponds to situations where the underlying stochastic process behaves similarly, it is reasonable to assume that, conditionally on the state of the HMM, the process has a certain distribution—typically, a mixture of Gaussian distributions whereby the parameters are estimated from the data. We refer the reader to [33] for a comprehensive tutorial on HMM. In a sense, our paper complements the work of [5], who also present a data-driven approach for a class of dynamic stochastic optimization problems but based on different machine learning techniques. While that work allows explicitly for the presence of features in the data, it requires the data to be independent and identically distributed, whereas in our case we are more interested in the situation where there is correlation which is captured by the HMM.

Although HMMs have been studied for decades, it appears that their use in the context of optimization models is limited. In the case of dynamic convex stochastic optimization models, the resulting structure of the HMM allows us to employ a variation of the now well-established Stochastic dual dynamic programming algorithm (SDDP) to solve such problems. The SDDP method was proposed in the seminal paper of [27] for problems where the uncertainty is stagewise independent but it was later extended to the case where there is Markovian dependence, although such models require the user to input the transition probability matrices; see, for instance, [22, 24, 31]. In a nutshell, the algorithm consists of alternating forward and backward steps: the forward step generates a sample path of the process to obtain a corresponding sequence of solutions, and the backward step successively approximates the value function in each period by linear cuts. SDDP has two particularly attractive features: first, it can solve large-scale stochastic dynamic problems, and second, it provides a policy rather than just a numerical solution. Such features are accomplished through the development of a sequence piecewise-linear value function approximations in each stage. Once the approximations are built, one can evaluate the optimal policy for an arbitrary realization of the stochastic process by solving a sequence of linear programs. Models that use SDDP as a solution technique have indeed become very popular in the literature, with applications in many areas such as energy, finance and transportation.

One drawback of HMMs, however, is that since the probabilities of transition between pairs of states are estimated from the data, the solutions of the optimization model that uses such transition probabilities will be very much dependent on the observed data, leading to a problem of overfitting. Such dependence may then lead to poor out-of-sample performance of the resulting policies due to estimation errors and model misspecification. We prevent such phenomenon from happening by employing a distributionally robust optimization (DRO) approach that allows for variations in the estimated transition probability matrix of the HMM. Our DRO model leads to tractable formulations that do not increase the computational complexity of the model; moreover, they can be solved by a variation of the SDDP.

The concern about out-of-sample performance is, unfortunately, often overlooked in the stochastic optimization literature, particularly in the case of dynamic models—oftentimes the user simply implements the first-stage decision given by the model and then re-solves the model in every period in order to obtain new decisions. As we discuss in the paper, such procedure can be expensive and wasteful, as it discards the value function approximations obtained in previous steps of a rolling horizon scheme. To counter that effect, we propose an extra step in the out-of-sample evaluation that allows for an improvement of the current value function approximations for updated values of the previous decisions (initial conditions of the current problem). Such feature makes the policies generated by the algorithm easier to use and quicker to evaluate.

We illustrate our ideas with a dynamic asset allocation problem. Such problem consists of decision processes under uncertainty with complex characteristics embedding the investor’s risk tolerance, transaction cost, and price dynamics. By building upon previous work, we propose a Data-Driven DRO approach that estimates an HMM for the return process and allows for ambiguity in the transition probability matrix with the thrust to enhance out-of-sample performance. In the numerical tests, we provide a comprehensive sensitivity analysis of the robustness and risk-aversion parameters, and execute the robustness tuning procedure to select the appropriate robustness level. The results obtained for a hold-out testing dataset show that the resulting portfolio can yield excellent results, with enhanced out-of-sample performance over selected benchmarks, including the equal-weight-allocation strategy, which has been shown to be optimal under certain assumptions [16, 25] with a competitive out-of-sample (empirical) performance [9].

In summary, the contributions of this paper are the following:

  1. 1.

    We consider a class of dynamic stochastic optimization problems with risk-aversion and, rather than specifying a distribution for the data as typical in the literature, we apply a machine learning approach—namely, a Hidden Markov Model—to learn the structure of the data, and incorporate that information into an optimization model;

  2. 2.

    To account for the estimation and misspecification errors resulting from the HMM estimation procedure, and to avoid excessive dependence of the model on the data, we propose a distributionally robust optimization approach to the problem whereby ambiguity is allowed in the Markov transition probability matrix estimated by the HMM;

  3. 3.

    By building from some ingredients from the literature, we present a variation of the SDDP algorithm that can solve the DRO problem mentioned in (2). Moreover, we provide deterministic lower and upper bounds and prove that the gap between lower and upper bounds becomes zero after a finite number of iterations. While we present the methodology in the context of the DRO problem, the approach can be easily adapted to other settings such as the standard SDDP, or the SDDP with nested risk measures that can be linearized, such as \(\mathrm{CV@R}\).

  4. 4.

    We propose a rolling-horizon scheme for out-of-sample evaluation of the policies generated by the algorithm that make such policies easier to use and quicker to evaluate for a fixed robustness level; Also, we propose a robustness tuning procedure as a series of out-of-sample evaluation steps, whereby the robustness level with best out-of-sample performance is selected.

  5. 5.

    A case study for an asset allocation problem is presented, demonstrating the benefits of the proposed approach over a benchmark from the literature. For this application, we propose an alternative novel deterministic lower-bound that exploits the structure of the problem.

2 A data-driven prescriptive analytics framework

We start by presenting a data-driven prescriptive analytics framework that integrates all the machine learning and optimization machinery in a consistent and efficient way to build a bridge from data to decisions. The basic building blocks of our proposed framework are (1) a predictive (machine learning) method to represent uncertainty, and (2) a prescriptive (optimization) model that takes into account estimation errors associated with the predictive model and allows for control of the risk associated with decisions. In our context, the predictive model is a Hidden Markov Model, and the prescriptive model is a distributionally robust dynamic optimization model with risk-based constraints that induces a (parameterized) level of robustness over the HMM transition probabilities given that they might be polluted with estimation errors.

In what follows, we describe these building blocks in more detail. Before that, however, we establish the notation for the dynamic stochastic optimization problem of interest. We consider a filtered probability space \((\varOmega ,\mathcal {F},\mathbb {P})\), where \(\mathcal {F}=\mathcal {F}_T\) and \(\mathcal {F}_0\subseteq \ldots \subseteq \mathcal {F}_T\). The input is represented by a stochastic process \(\{\varvec{\xi }_t\}\) with values in \(\mathbb {R}^m\) such that the sigma-algebra generated by \(\{\varvec{\xi }_s, K_s\}_{s=0}^t\) is contained in \(\mathcal {F}_t\), and a Markov chain \(\{K_t\}\) with unobservable states such that the following assumption holds:

Assumption 1

The process \(\{K_t\}\) is a hidden time-homogeneous Markov chain with finite state-space \(\mathcal {K}.\)

In other words, we assume that the state given by \(K_t\) can not be observed or directly measured by the decision maker, but it respects the Markov property

$$\begin{aligned} P\left( K_{t}=k,\,|\,K_{t-1}=j,K_{t-2}=i, \ldots \right) = P\left( K_{t}=k,\,|\,K_{t-1}=j\right) = P_{jk}, \end{aligned}$$

where transition probability matrix P does not change over time.

Assumption 2

The process \(\{\varvec{\xi }_t\}\) is an observable time-dependent vector-valued stochastic process that directly affects the performance of the underlying prescriptive model. The distribution of each \(\varvec{\xi }_t\) depends only on t and on the current (unobservable) state of the Markov chain \(\{K_t\}\).

Assumptions 1 and 2 are basic for HMMs and allow for modeling different “states” of the system. For instance, in the financial model discussed in Sect. 5, the process \(\{\varvec{\xi }_t\}\) corresponds to the financial returns of each asset. A typical HMM estimation would reveal that the Markov states correspond to the market being in a “bull”, “regular”, or “bear” stateFootnote 1. It is important to stress that such states are unobservable to the user—rather, they are learned directly from the data. In the financial example, we never observe directly the state of the market, but we can use the sequence of historical asset returns to estimate the transition probability matrix and the probability distribution of future returns conditioned to each Markov state.

Assumption 2 also implies that, conditionally on each given (unobservable) state of the Markov chain \(\{K_t\}\), the underlying stochastic process \(\{\varvec{\xi }_t\}\) is stagewise independent. That is,

$$\begin{aligned}&P\left( \varvec{\xi }_{t}\in A,\, \varvec{\xi }_{t+1}\in B\,|\,K_{t}=j,\, K_{t+1}=k\right) =\qquad \qquad \\&P\left( \varvec{\xi }_{t}\in A\,|\,K_{t}=j\right) \ P\left( \varvec{\xi }_{t+1}\in B\,|\, K_{t+1}=k\right) . \end{aligned}$$

The feasibility set in each time period consists of \(\mathcal {F}_t\)-adapted solutions \(\mathbf {x}_t\in \mathbb {R}^n\) such that \(\mathbf {x}_{t} \in \mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)\), where the set \(\mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)\) is given by linear inequalities which may involve \(\mathbf {x}_{t-1}\) and \(\varvec{\xi }_t\). We will detail the formulation of the model shortly.

2.1 Predictive method: the hidden Markov model framework

Following standard practice in the literature, given historical data of the underlying stochastic process \(\{\varvec{\xi }_t\}\), the database is split into three parts: training, validation and test datasets. The training data are used to estimate the parameters of the HMM; for instance, the standard EM (expectation-maximization) algorithm [26] can be used to estimate means, variances, and covariances from data, as well as the nominal transition probability matrix. The validation data are used to tune some parameters of the optimization model, as described later in the paper. The final algorithm, with the tuned parameters, is then applied to the testing data.

In general, the HMM parameters estimated from data are the transition probabilities \(\hat{p}_j(k) = P\left( K_{t+1} = k \big | K_{t} = j \right) , \forall t = 0,\ldots ,T-1\), and the coefficients \(\varTheta\) associated with conditional density function \(p(\xi _t \big | K_t = k \,; \varTheta )\). As the name suggests, the expectation-maximization (EM) estimation algorithm is an iterative procedure composed by two steps. Given initial values for \(\hat{p}\) and \(\varTheta\), the expectation step computes the probability of occurrence of Markov state j at time t given the uncertainty-realization trajectory \(\varvec{\xi }_{[1,T]}=\{\varvec{\xi }_1,\ldots ,\varvec{\xi }_T\}\) and the coefficients \(\varTheta\), i.e., \(P(K_{t} = j \big | \varvec{\xi }_1,\ldots ,\varvec{\xi }_T; \varTheta ), \forall t = 1,\ldots ,T, j \in \mathcal {K}\). Then, the maximization step updates the values of the transition probability \(\hat{p}\) and the coefficients \(\varTheta\) in order to maximize the likelihood of the observed data under the assumed hidden Markov structure. These two steps are repeated until a desired level of convergence is achieved.

Once the parameters are estimated, the HMM can then be used to classify the current state j of the Markov chain and therefore the (conditional) distribution of stochastic process \(\{\varvec{\xi }_t\}\), which due to Assumption 2 depends only on j. The current state classification is obtained by conducting statistical inference of the current state given all information so far. More specifically, we obtain from the HMM parameters the quantity \(P\left( K_t = k \big | \varvec{\xi }_t, \varvec{\xi }_{t-1}, \ldots , \varvec{\xi }_1\right) , \forall k \in \mathcal {K},\) hereinafter referred to as the posterior probability of state k at time t, which is defined as

$$\begin{aligned} P\left( K_t = k \big | \varvec{\xi }_t, \varvec{\xi }_{t-1}, \ldots , \varvec{\xi }_1\right) \ :=\ \frac{p(K_t = k, \varvec{\xi }_t, \varvec{\xi }_{t-1}, \ldots , \varvec{\xi }_1)}{\sum _{j \in \mathcal {K}} p(K_t = j, \varvec{\xi }_t, \varvec{\xi }_{t-1}, \ldots , \varvec{\xi }_1)}, \end{aligned}$$

where \(p(\varvec{\xi }_t, \varvec{\xi }_{t-1}, \ldots , \varvec{\xi }_1, K_t = k)\) is the joint probability density function evaluated at the observed sample path and the current Markov state being k. This joint probability density function is obtained by an iterative procedure called the HMM forward pass, see [6]. Now, we can classify the current state

$$\begin{aligned} k_t^* \in {\mathop {\hbox {argmax}}\limits _{k \in \mathcal {K}}} \;P\left( K_t = k \big | \varvec{\xi }_t, \varvec{\xi }_{t-1}, \ldots , \varvec{\xi }_1\right) \end{aligned}$$

as the most probable Markov state given all available information at time t.

2.2 Basic prescriptive method: risk-constrained dynamic stochastic programming

In addition to the aforementioned constraints \(\mathbf {x}_{t} \in \mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)\), we also consider risk-based constraints in our optimization model. The introduction of such constraints allows us to control the risk associated with decisions. The incorporation of risk control in dynamic stochastic optimization models has been the subject of considerable work in the literature since issues such as time consistency must be taken into account; see, e.g., [37, 42] for discussions. Following [37], “a policy is time consistent if and only if the future planned decisions are actually going to be implemented.” As we shall see shortly, our risk-constrained model satisfies time consistency as it can be formulated in a recursive manner. This dynamic stochastic programming model will then be extended to a distributionally robust dynamic model in Sect. 2.3.

Before describing the risk-constrained dynamic model, we briefly recall the notion of Conditional Value-at-Risk (\(\mathrm{CV@R}\)) risk measure defined in [35]. Given a random variable Z representing some quantity such that larger values are less favorable (for instance, losses), we write \(\mathrm{CV@R}_{\beta }\left[ Z\right] = \min _{\eta \in \mathbb {R}} \big \{\eta +\) \(\frac{1}{1-\beta }\mathbb {E}\left[ \left( Z-\eta \right) _+\right] \big \}.\)

This risk measure is concentrated on the right tail of the distribution of Z. When the random variable of interest is such that larger values are more favorable, then it is more appropriate to refer to acceptability functionals rather than to risk measures (see, e.g., [36]). We call this variable of interest “wealth” and denote it by W. Also, we denote by \(\phi _{\alpha }[W]\) the acceptability functional corresponding to \(\mathrm{CV@R}\), i.e., \(\phi _{\alpha }[W]:= -\mathrm{CV@R}_{1-\alpha }\left[ -W\right]\), which can be written as (we omit the subscript \(\alpha\) as it is fixed throughout the paper)Footnote 2

$$\begin{aligned} \phi [W]\ = \ \max _{z\in \mathbb {R}} \left\{ z -\frac{1}{\alpha }\mathbb {E}\left[ \left( z-W\right) _+\right] \right\} . \end{aligned}$$

We return now to our model. In each period t, given a feasible solution \(\mathbf {x}_t\in \mathbb {R}^n\) we define a function \(g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\) to represent the “wealth” resulting from the decision \(\mathbf {x}_t\). Note that \(g_t\) depends on the random variable \(\varvec{\xi }_{t+1}\) which has not been realized yet at time t, but it does not depend on future values \(\varvec{\xi }_{t+2},\ldots , \varvec{\xi }_{T}\). To simplify the notation, let \(W_{t+1}:=g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\). Then, we apply the acceptability functional \(\phi\) defined in (3) to \(W_{t+1}\) conditionally on the current state j of the HMM, resulting in the quantity

$$\begin{aligned} \phi _{{\hat{\mathbf{p}}}_j}\left[ W_{t+1}\right] \ :=\ \max _{z\in \mathbb {R}} \left\{ z -\frac{1}{\alpha }\,\sum _{k\in \mathcal {K}}\mathbb {E}\left[ \left( z-W_{t+1}\right) _+ \big | K_{t+1} = k\right] \, \hat{p}_j(k)\right\} . \end{aligned}$$

Note that the expectation in (4) and hereinafter is associated with the subsequent random variable \(\varvec{\xi }_{t+1}\). With this notation, the risk-based constraint can then be expressed as

$$\begin{aligned} \phi _{{\hat{\mathbf{p}}}_j}\left[ g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\right] \ \ge \ 0. \end{aligned}$$

Finally, in each period t, given a feasible solution \(\mathbf {x}_t\in \mathbb {R}^n\) and a realization of \(\varvec{\xi }_t\), a reward of \(f_t(\mathbf {x}_t,\varvec{\xi }_t)\) is accrued. For each state j of the HMM, the optimization model is then written as

$$\begin{aligned} Q_{0}^j\ := \ &\underset{ \mathbf {x}_{0} \in \mathcal {X}_0}{\max }\,&\quad f_0(\mathbf {x}_0)+\sum \limits _{k\in \mathcal {K}}\mathbb {E}\left[ Q_{1}^k\left( \mathbf {x}_0, \varvec{\xi }_{1} \right) \big | K_{1} = k\right] \hat{p}_j(k) \end{aligned}$$
$$\begin{aligned}&\text{ s.t. }&\quad \phi _{{\hat{\mathbf{p}}}_j}\left[ g_0(\mathbf {x}_0,\varvec{\xi }_{1})\right] \ \ge \ 0. \end{aligned}$$

where \(\mathcal {X}_0\) represents linear constraints on \(\mathbf {x}_0\) and, for each \(t=1,\ldots ,T-1\) and for each state j of the HMM,

$$\begin{aligned} Q_{t}^j(\mathbf {x}_{t-1}, \varvec{\xi }_t)\ := \nonumber \\ \underset{ \mathbf {x}_{t} \in \mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)}{\max }\,&f_t(\mathbf {x}_t,\varvec{\xi }_t)+\sum \limits _{k\in \mathcal {K}}\mathbb {E}\left[ Q_{t+1}^k\left( \mathbf {x}_t, \varvec{\xi }_{t+1} \right) \big | K_{t+1} = k\right] \hat{p}_j(k) \end{aligned}$$
$$\begin{aligned} \text{ s.t. } \quad \quad&\phi _{{\hat{\mathbf{p}}}_j}\left[ g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\right] \ \ge \ 0. \end{aligned}$$

The final-stage function \(Q_T\) is defined as

$$\begin{aligned} Q_{T}^j(\mathbf {x}_{T-1}, \varvec{\xi }_T)\ := \ &\underset{ \mathbf {x}_{T} \in \mathcal {X}_T( \mathbf {x}_{T-1}, \varvec{\xi }_T)}{\max }\,&f_T(\mathbf {x}_T,\varvec{\xi }_T) . \end{aligned}$$

Equations (5)–(9) define the risk-averse dynamic stochastic optimization we would like to solve. It is crucial to notice that, since the risk function is applied only locally in each period through the constraints \(\phi _{{\hat{\mathbf{p}}}_j}\left[ g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\right] \ge \ 0\), time consistency is ensured (see [46])—after all, the problem still has a nested form, which is a necessary condition for time-consistency (see, e.g., [19, 42]). Such constraints could be used, for example, to provide a convex approximation of probabilistic constraints of the form \(\mathbb {P}\left( g_t(\mathbf {x}_t,\varvec{\xi }_{t+1}) \le 0\right) \le \alpha\).

Observe that the policy resulting from solving the above recursive formulation requires the decision-maker to know the current state in each period, which contradicts the fact that such states are hidden. One way to circumvent this issue would be to consider the Markov posterior probabilities as state variables of the dynamic stochastic program; however, it would lead to a non-convex problem, which is generally intractable. In particular, the recent work by [10] considers a similar problem class with no risk constraints. The authors explore the saddle function structure and provide an efficient solution algorithm for that problem class. However, the solution methodology proposed by the authors is not suitable for our risk-constrained dynamic stochastic optimization, nor do they consider a DRO formulation. As discussed in Sect. 4, we circumvent that problem by applying a rolling horizon scheme and propose two alternative ways of using HMM posterior probability given in Eq. (1) to obtain the first-stage decisions: (i) by using the HMM classification techniques, i.e., by taking the most likely Markov state given by Eq. (2) to determine the current Markov state; and (ii) by modifying the first-stage problem by replacing the transition probability (conditioned to knowing the current state) with the posterior probability (1), which is a function only of historical return realizations.

We will make the following assumptions for the remainder of the paper (for convenience we set \(g_{T}\equiv 0\)):

Assumption 3

For any \(t=0,\ldots ,T\) and any realization of \(\varvec{\xi }_1,\ldots ,\varvec{\xi }_{t+1}\), the functions \(f_t(\cdot ,\varvec{\xi }_t)\) and \(g_t(\cdot ,\varvec{\xi }_{t+1})\) are affine.

Assumption 3 can be relaxed; for example, to the case where \(f_t\) and \(g_t\) are defined as the minimum of affine functions. Nevertheless, we keep the linear assumption for simplicity.

Assumption 4

For any \(t=1,\ldots ,T\), the set \(\mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)\) is non-empty for any \(\mathbf {x}_{t-1}\) feasible in stage \(t-1\), i.e., the problem has relatively complete recourse. The set \(\mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)\) consists of vectors \(\mathbf {x}_t \in \mathbb {R}^n\) satisfying linear inequalities of the form \({\mathbf{A}}_t \mathbf {x}_t\ = \ {\mathbf{b}}_t - {\mathbf{B}}_t\mathbf {x}_{t-1}\) with \(\mathbf {x}_t \ \ge \ 0,\) where \(\varvec{\xi }_t = \{ {\mathbf{A}}_t, {\mathbf{b}}_t, {\mathbf{B}}_t \}\). Also, the set \(\mathcal {X}_0\) has the form \(\{\mathbf {x}_0 \in \mathbb {R}^N_+ | {\mathbf{A}}_0 \mathbf {x}_0\ = \ {\mathbf{b}}_0\}\).

Assumption 4 imposes a polyhedral structure on the set \(\mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)\), which will be useful in the developments that follow.

2.3 Extended prescriptive method: a data-driven distributionally robust dynamic model

The HMM approach described in Sect. 2.1 has the advantage of learning directly from the data. However, HMM estimated probabilities are very sensitive to changes in the training data. Such sensitivity may cause considerable instability in the optimization model, with similar observed data leading to different performances of the corresponding optimal solutions. Moreover, poorly estimated probabilities will likely lead to poor out-of-sample performance of the solutions proposed by the model. As stated in Sect. 1, our primary goal is to provide a data-driven approach for dynamic stochastic optimization problems which performs well out of sample. Therefore, it is critical to address this estimation shortcoming.

Our approach to circumvent the estimation issue is to use a distributionally robust optimization (DRO) model for the problem. The idea of DRO is to construct an ambiguity set for the distributions of the random variables of the problem and then to optimize the worst-case within the ambiguity set. DRO problems have long been studied in the literature (albeit with a different terminology), starting from the seminal work of [39], then followed by [48] and later by [41, 43] and [17]. Much of the recent literature on this topic focuses on ways of constructing the ambiguity set (call it \(\mathcal {P}\)) that ensure tractability of the resulting problem. For example, in [8] the authors define \(\mathcal {P}\) as the set of distributions that have a given mean and covariance matrix. Another popular approach is to define \(\mathcal {P}\) as the set of distributions that are not “too far” from some reference distribution. Of course, such a notion requires defining an appropriate way to measure the “distance” between distributionsFootnote 3. Several such distances exist, for instance, the Kantorovich and Wasserstein distances, the Kullback–Leibler divergence, and Chi-squared distance (and more generally phi-divergences), among others. This is a growing field with substantial current activity; we refer to [3, 4, 25, 28, 34] for some of the work in this area. The benefit of using DRO—under certain settings—to improve out-of-sample performance can be formally demonstrated; see [47].

Some recent works in the literature are directly related to the present paper as they also study DRO models for multistage stochastic programs: in [32] and [12], the authors use ambiguity sets differing in terms of the probability metrics used—respectively \(\chi ^2\) and Wasserstein distances—and in both cases an adaptation of SDDP is provided to solve the resulting problem. These works robustify the optimal policy against the ambiguity over the nominal stagewise independent probability distribution but neglect to consider the dynamics of the data-generating process. Our work, on the other hand, allows for time dependence through the structure rendered by the HMM, and also robustifies the optimal policy against ambiguity over the estimated transition matrix. Dealing with ambiguity only in the transition matrix is helpful since the number of points in the support (which is the number of states in the HMM) is small and thus the dimension of the DRO model is not very large. Moreover, we provide deterministic both lower and upper bounds for the objective function, as discussed in the subsequent sections, whereas those aforementioned works only provide lower bounds (for minimization problems).

In our DRO model, Eqs. (7)–(8) are replaced with the following:

$$\begin{aligned} Q_{t}^j(\mathbf {x}_{t-1},&\varvec{\xi }_t)\ = \nonumber \\ \quad \underset{ \mathbf {x}_{t} \in \mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)}{\max }\,&f_t(\mathbf {x}_t,\varvec{\xi }_t)+\left\{ \underset{\mathbf {p}_j\in \; \mathcal {P}_{j}}{\min } \, \sum \limits _{k\in \mathcal {K}}\mathbb {E}\left[ Q_{t+1}^k\left( \mathbf {x}_t, \varvec{\xi }_{t+1} \right) \big | K_{t+1} = k\right] p_j(k)\right\} \end{aligned}$$
$$\begin{aligned} \text{ s.t. } \quad \quad&\left\{ \underset{\mathbf {p}_j\in \; \mathcal {P}_{j}}{\min } \, \phi _{\mathbf {p}_j}\left[ g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\right] \ \right\} \ge 0 \end{aligned}$$

and Eqs. (5) and (6) are also replaced accordingly. In the above model, \(\mathcal {P}_{j}\) is the ambiguity set for the distribution of the next state of the Markov chain, conditionally on the current state j, and is defined as

$$\begin{aligned} \mathcal {P}_j = \left\{ \mathbf {p}_j\in \mathbb {R}^{|\mathcal {K}|} \left| \mathbf{1}^\top \mathbf {p}_j= 1, \ d(\mathbf {p}_j,{\hat{\mathbf{p}}}_j) \le \varDelta ,\ \mathbf {p}_j\ge 0 \right. \right\} ,&\end{aligned}$$

where \(d(\mathbf {p}_j,{\hat{\mathbf{p}}}_j)\) measures the total variation distance between \(\mathbf {p}_j\) and \({\hat{\mathbf{p}}}_j\) (recall that \({\hat{\mathbf{p}}}_j\) is the vector of state-j probabilities estimated for the Markov chain), i.e.,

$$\begin{aligned} d(\mathbf {p}_j,{\hat{\mathbf{p}}}_j) := (1/2)\, \mathbf{1}^\top |\mathbf {p}_j - {\hat{\mathbf{p}}}_j|. \end{aligned}$$

In the above formulation, the parameter \(\varDelta\) controls the level of ambiguity allowed in the model—a value of \(\varDelta =0\) indicates that the estimated probabilities can be fully trusted, whereas a value of \(\varDelta =1\) ignores the estimated probabilities and simply optimizes with respect to the worst-case state of the HMM. Note that the use of ambiguity sets in the DRO formulation addresses model misspecification and estimation errors in the HMM transition probabilities, whereas the use of \(\mathrm{CV@R}\) aims at measuring the risk of losses with respect to the scenarios. Therefore, there is no redundancy in using both techniques.

The use of a distributionally robust model for the transition probabilities of the Markov chain affects both the objective function and the \(\mathrm{CV@R}\) constraint—note that the expression \(\underset{\mathbf {p}_j\in \; \mathcal {P}_{j}}{\min } \{\cdot \}\) appears in both places. Using the same worst-case probability distribution for both the objective function and the \(\mathrm{CV@R}\) constraint makes the separation between the inner and outer problems impossible [28]. Nevertheless, if we allow for two separated worst-case probability distributions (one for the objective function and other for the constraint), the problem becomes more tractable. This “separation” can be conceptually motivated by the idea that the objective function is a “worst case” expectation while the risk constraint must be feasible for any transition probability in the ambiguity set, i.e., \(\phi _{\mathbf {p}_j}\left[ g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\right] \ \ge 0, \ \forall \mathbf {p}_j\in \; \mathcal {P}_{j}.\) For convenience, we will use the same ambiguity set for both the objective function and the constraint, however, one could use different sets (for instance, defined with different \(\varDelta \hbox {s}\)) to allow for different levels of robustness in each expression. Observe also that, while other probability distance functions could be used instead of the total variation distance in (13), the choice for the total variation is natural in this setting where there are only a finite number of Markov states. Moreover, as we shall see later, with the total variation distance the model can be efficiently solved because the robust counterpart is a linear optimization problem.

3 Solution methodology

In this section we propose an efficient way to solve the data-driven distributionally robust dynamic model: we develop a computationally tractable dual reformulation, and then we adapt the stochastic dual dynamic programming (SDDP) algorithm to suit the proposed model. Significant modifications are needed in SDDP. In particular, we highlight the development of a deterministic lower bound (for a maximization problem), which, while related to results recently proposed in the literature, is a novel result with a practical appeal. In the following subsections we describe these steps in detail.

3.1 A tractable dual reformulation

In this section, we present a tractable formulation of (10)–(11) based on the dual of the inner minimization problems in those equations. Consider initially the inner problem in (10). For a fixed \(\mathbf {x}_{t} \in \mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)\), the dual formulation of the inner problem (10) is

$$\begin{aligned} \underset{\mathbf{\varvec{\theta }^-, \varvec{\theta }^+},\lambda ,\eta }{\max } \quad&\sum _{k\in \mathcal {K}}\hat{p}_j(k)(\theta _k^+-\theta _k^-) - \eta - 2\varDelta \lambda \nonumber \\ \text {s.t. } \quad&- \theta _k^- + \theta _k^+ -\eta \le \mathbb {E}\left[ Q_{t+1}^k\left( \mathbf {x}_t, \varvec{\xi }_{t+1}\right) \big | K_{t+1} = k\right] , \quad \forall k \in \mathcal {K}\nonumber \\&\theta _k^- + \theta _k^+ - \lambda = 0, \quad \forall k \in \mathcal {K}\nonumber \\&\mathbf{\varvec{\theta }^-, \varvec{\theta }^+},\lambda \ge 0. \end{aligned}$$

By using a similar approach it is possible to construct a dual formulation to write (11) in a more tractable manner. Note that from (4) we can write

$$\begin{aligned} \phi _{\mathbf {p}_j}\left[ g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\right] = \max _{z\in \mathbb {R}}\, \left\{ h(z,\mathbf {p}_j) :=z - \frac{1}{\alpha }\sum _{k \in \mathcal {K}}\mathbb {E}\left[ \left( z-g_t(\mathbf {x}_t,\varvec{\xi }_{t+1})\right) _+ \big | K_{t+1}=k\right] p_j(k)\right\} . \end{aligned}$$

The function h is concave in z and linear in \(\mathbf {p}_j\). Thus, given that \(\mathcal {P}_j\) is compact, we can deduce from Sion’s minimax theorem [45] that \(\min _{\mathbf {p}_j\in \mathcal {P}_{j}} \max _{z\in \mathbb {R}} h(z,\mathbf {p}_j) = \max _{z\in \mathbb {R}} \min _{\mathbf {p}_j\in \mathcal {P}_{j}}h(z,\mathbf {p}_j)\) and hence it follows that for any given z, \(\min _{\mathbf {p}_j\in \mathcal {P}_{j}}h(z,\mathbf {p}_j)\) can be written as

$$\begin{aligned} \underset{\mathbf {p}_j\ge 0,\mathbf {e}}{\min } \quad&z - \frac{1}{\alpha }\mathbb {E}\left[ \left( z-g_t(\mathbf {x}_t,\varvec{\xi }_{t+1}) \right) _+ \big | K_{t+1}=k\right] p_j(k)\nonumber \\ \text{ s.t. } \quad&p_j(k)- e_k \le \hat{p}_j(k), \quad \forall k \in \mathcal {K}&:\widetilde{\theta }_k^- \nonumber \\&p_j(k)+ e_k \ge \hat{p}_j(k), \quad \forall k \in \mathcal {K}&:\widetilde{\theta }_k^+ \nonumber \\&\sum _{k \in \mathcal {K}}\ e_k \le 2\varDelta&: \widetilde{\lambda } \nonumber \\&\sum _{k \in \mathcal {K}}\ p_j(k)= 1&:\widetilde{\eta } \end{aligned}$$

By writing the dual of (15) analogously to (14), it follows that the left-hand side of (11) can be written as the optimization problem

$$\begin{aligned} \underset{z, \widetilde{\varvec{\theta }}^-, \widetilde{\varvec{\theta }}^+,\widetilde{\lambda },\widetilde{\eta }}{\max } \quad&z+\sum _{k\in \mathcal {K}}\hat{p}_j(k)(\widetilde{\theta }_k^+-\widetilde{\theta }_k^-) - \widetilde{\eta } - 2\varDelta \widetilde{\lambda } \nonumber \\ \text{ s.t. } \quad \quad&- \widetilde{\theta }_k^- + \widetilde{\theta }_k^+ -\widetilde{\eta } \le - \frac{1}{\alpha }\mathbb {E}\left[ \left( z-g_t(\mathbf {x}_t,\varvec{\xi }_{t+1}) \right) _+ \big | K_{t+1}=k\right] , \quad \forall k \in \mathcal {K}\nonumber \\&\widetilde{\theta }_k^- + \widetilde{\theta }_k^+ - \widetilde{\lambda } = 0, \quad \forall k \in \mathcal {K}\nonumber \\&\widetilde{\varvec{\theta }}^-, \widetilde{\varvec{\theta }}^+,\widetilde{\lambda } \ge 0. \end{aligned}$$

Finally, by merging (14) with the outer maximization problem, adding the inequalities and variables of (16), and imposing that the objective function of (16) is non-negative, we obtain the single-level reformulation of (10)–(11):

$$\begin{aligned}&Q_{t}^j(\mathbf {x}_{t-1}, \varvec{\xi }_t)\ = \nonumber \\&\underset{ \begin{array}{c} \mathbf {x}_t, z, \varvec{\theta }^-, \varvec{\theta }^+, \lambda , \\ \eta , \widetilde{\varvec{\theta }}^-, \widetilde{\varvec{\theta }}^+, \widetilde{\lambda }, \widetilde{\eta } \end{array}}{\max } \quad f_t(\mathbf {x}_t,\varvec{\xi }_{t})+ \sum _{k\in \mathcal {K}}\hat{p}_j(k)(\theta _k^+-\theta _k^-) - \eta - 2\varDelta \lambda \end{aligned}$$
$$\begin{aligned} \text{ s.t. }\ \ &z + \sum _{k\in \mathcal {K}}\hat{p}_j(k)(\widetilde{\theta }^+_k-\widetilde{\theta }^-_k) - \widetilde{\eta } - 2\varDelta \widetilde{\lambda } \ge 0 \end{aligned}$$
$$\begin{aligned}&-\widetilde{\theta }^-_k +\widetilde{\theta }^+_k -\widetilde{\eta } + \frac{1}{\alpha }\mathbb {E}\left[ \left( z-g_t(\mathbf {x}_t,\varvec{\xi }_{t+1}) \right) _+ \big | K_{t+1}=k\right] \le 0, \ \forall k\in \mathcal {K} \end{aligned}$$
$$\begin{aligned}&-\theta _k^- +\theta _k^+ -\eta - \mathbb {E}\left[ Q_{t+1}^k\left( \mathbf {x}_t, \varvec{\xi }_{t+1}\right) \big | K_{t+1} = k\right] \le 0,\ \forall k \in \mathcal {K} \end{aligned}$$
$$\begin{aligned}&\quad \theta _k^- +\theta _k^+ -\lambda = 0, \ \forall k \in \mathcal {K} \end{aligned}$$
$$\begin{aligned}&\quad \widetilde{\theta }^-_k +\widetilde{\theta }^+_k -\widetilde{\lambda } = 0, \ \forall k \in \mathcal {K} \end{aligned}$$
$$\begin{aligned}&\quad \varvec{\theta }^-, \varvec{\theta }^+, \widetilde{\varvec{\theta }}^-, \widetilde{\varvec{\theta }}^+,\lambda , \widetilde{\lambda } \ge 0 \end{aligned}$$
$$\begin{aligned}&\quad \mathbf {x}_t \in \; \mathcal {X}_t(\mathbf {x}_{t-1}, \varvec{\xi }_t). \end{aligned}$$

Problem (17)–(24) is a multistage stochastic program, which is convex under Assumptions 3 and 4. It provides a conceptually tractable reformulation of (10)–(11). The word “conceptually” refers to the fact that such model cannot be directly implemented, for two reasons: first, inequalities (19) and (20) involve expectations and second, inequality (20) involves the unknown value function \(Q_{t+1}^k\left( \mathbf {x}_t, \varvec{\xi }_{t+1}\right)\). The first difficulty can be dealt with employing sample average approximations. The second difficulty appears more daunting due to the curse of dimensionality. For instance, when no assumptions are made about the input process \(\{\varvec{\xi }_t\}\) there is a vast number of possible outcomes at each stage and the number of scenarios grows exponentially with the number of stages. As we shall see in Sect. 3.2, however, under Assumption 2 we can adapt the SDDP method to our setting, which allows us to approximate the value function \(Q_{t+1}^k\left( \mathbf {x}_t, \varvec{\xi }_{t+1}\right)\) by piecewise-linear functions and so standard optimization methods can be used to solve the problem.

We can construct a sample average approximation of problem (17)–(24), which allows us to replace the expectations in (19) and (20) with averages of random realizations sampled from the “true” distributions. First, for each state \(k \in \mathcal {K}\) of the Markov chain, we draw i.i.d. samples from the conditional distribution of \(\varvec{\xi }_{t+1}\) given \(K_{t+1}=k\). We denote those samples by \(\{\varvec{\xi }^k_{t+1}(s)\}_{s\in \mathcal {S}_k}\). Next, define the probability \(q_k(s)\) of scenario s conditional on state k of the Markov chain as

$$\begin{aligned} q_k(s) \ :=\ \mathbb {P}\left( \varvec{\xi }_{t+1} = \varvec{\xi }^k_{t+1}(s)\big | K_{t+1} = k\right) . \end{aligned}$$

For instance, if the sample is generated via a Monte Carlo simulation or Latin Hypercube Sampling, then we would define equally probable scenarios \(q_k(s) = 1 / |\mathcal {S}_k|\), conditionally on the Markov state \(K_{t+1} = k\). However, \(q_k(s)\) might be defined differently if other technique, such as importance sampling, is used. Moreover, we introduce variables \(y_{ks}\) for each \(k \in \mathcal {K}\) and each \(s\in \mathcal {S}_k\) in order to linearize the “plus” function in (19). Finally, the expected value function in (20) is expressed as

$$\begin{aligned} {\mathcal {Q}}^k_{\,t+1}(\mathbf {x}_{t}) := \sum _{ s \in \mathcal {S}_k} {Q}_{\,t+1}^k(\mathbf {x}_{t}, \varvec{\xi }^k_{t+1}(s)) \, q_k(s). \end{aligned}$$

3.2 Modified stochastic dual dynamic programming algorithm

With a sample approximation of model (17) –(24) at hand, the only remaining issue is dealing with the value function in constraint (20). As discussed earlier, we adapt the SDDP method for this purpose. The SDDP algorithm is mainly characterized by two steps: a forward-in-time simulation and a backward-in-time recursion. The forward step generates trial solutions that are later used in the backward step to construct cutting-plane approximations of the future value function. When there is Markovian dependency, the forward step must generate (i) a path of states of the Markov chain and (ii) sample paths of the process \(\{\varvec{\xi }_t\}\) conditionally on each sampled state of the Markov chain. Then, trial solutions are created by solving the problem with the current value function approximations at each stage using the sampled processes. It is important to mention here that, in our context, the forward steps are generated using the nominal transition probability matrix given by the HMM; as we shall see in Sect. 3.4, such a property is crucial to prove convergence of the method. The backward step uses trial solutions and goes in the opposite time direction, from \(t=T\) to \(t=1\), adding cuts to improve the outer approximation of the value function. In the context of a maximization problem, we can obtain a deterministic upper bound using the outer approximation generated by the SDDP backward procedure.

We remark that model (17)–(24) is not in the standard form of problems solved by SDDP since the value function appears in the constraint (20) rather in the objective function as customary in the literature. A similar situation arises in the model studied by [31], albeit in a somewhat different context since that paper deals with nested risk measures. Thus, for the sake of completeness, we detail the steps and show how to construct an upper (i.e., outer) approximation for the value function. In Sect. 3.3 we will discuss how to construct a lower (inner) approximation.

Suppose we are in iteration \(\nu\) of the algorithm. Consider the sample approximation of problem (17)–(24) with constraint (20) replaced with an upper approximation \(\overline{\mathcal {Q}}_{t+1}^{\,j,\nu }(\mathbf {x}_t)\) given by linear inequalities, and denote the optimal value of the approximated problem by \(\widetilde{Q}^{j,\nu }_t( \mathbf {x}_{t-1}, \varvec{\xi }_{t})\). We will detail shortly how to construct \(\overline{\mathcal {Q}}_{t+1}^{\,j,\nu }(\mathbf {x}_t)\). Then, we have, for \(t=T-1\) to \(t=0\),

$$\begin{aligned}&\widetilde{Q}^{j,\nu }_t( \mathbf {x}_{t-1}, \varvec{\xi }_{t}):= \quad \nonumber \\&\underset{ \begin{array}{c} \mathbf {x}_t, z, \mathbf {y}, \varvec{\theta }^-, \varvec{\theta }^+, \lambda , \\ \eta , \widetilde{\varvec{\theta }}^-, \widetilde{\varvec{\theta }}^+, \widetilde{\lambda }, \widetilde{\eta }, \mathbf {u}\end{array}}{\max } \quad&f_t(\mathbf {x}_t,\varvec{\xi }_{t})+ \sum _{k\in \mathcal {K}}\hat{p}_j(k)(\theta _k^+-\theta _k^-) - \eta - 2\varDelta \lambda \end{aligned}$$
$$\begin{aligned}&\text{ s.t. } \quad&z + \sum _{k\in \mathcal {K}}\hat{p}_j(k)(\widetilde{\theta }^+_k-\widetilde{\theta }^-_k) - \widetilde{\eta } - 2\varDelta \widetilde{\lambda }\ \ge \ 0 \end{aligned}$$
$$\begin{aligned}&-\widetilde{\theta }^-_k +\widetilde{\theta }^+_k -\widetilde{\eta } + \sum _{ s \in \mathcal {S}_k}y_{ks} \; \frac{q_k(s)}{\alpha } \ \le \ 0,&\ \forall k\in \mathcal {K} \end{aligned}$$
$$\begin{aligned}&-\theta _k^- +\theta _k^+ -\eta - \overline{\mathcal {Q}}_{t+1}^{\,k,\nu }(\mathbf {x}_t) \ \le \ 0,&\ \forall k \in \mathcal {K} \end{aligned}$$
$$\begin{aligned}&\theta _k^- +\theta _k^+ -\lambda = 0,&\forall k \in \mathcal {K} \end{aligned}$$
$$\begin{aligned}&\widetilde{\theta }^-_k +\widetilde{\theta }^+_k -\widetilde{\lambda } = 0,&\forall k \in \mathcal {K} \end{aligned}$$
$$\begin{aligned}&z -g_t(\mathbf {x}_t,\varvec{\xi }_{t+1}^k (s)) - y_{ks} \ \le \ 0,&\forall k \in \mathcal {K}, \forall s \in \mathcal {S}_k \end{aligned}$$
$$\begin{aligned}&\mathbf {u}= \mathbf {x}_{t-1}&: \, {{\varvec{\pi }}}_t^j (\varvec{\xi }_t) \end{aligned}$$
$$\begin{aligned}&\mathbf {x}_t \in \mathcal {X}_t( \mathbf {u},\varvec{\xi }_{t}) \end{aligned}$$
$$\begin{aligned}&\mathbf{\varvec{\theta }^-, \varvec{\theta }^+, \widetilde{\varvec{\theta }}^-, \widetilde{\varvec{\theta }}^+, y},\lambda , \widetilde{\lambda } \ge 0. \end{aligned}$$

For \(t=T\) we have, at all iterations \(\nu\), the simpler problem

$$\begin{aligned}&\widetilde{Q}^{j,\nu }_T( \mathbf {x}_{T-1}, \varvec{\xi }_{T}):= &\underset{ \begin{array}{c} \mathbf {x}_T, \mathbf {u}\end{array}}{\max } \quad&f_T(\mathbf {x}_T,\varvec{\xi }_{T}) \end{aligned}$$
$$\begin{aligned}&\qquad \qquad \qquad\qquad \qquad \qquad\text{ s.t. }&\mathbf {u}= \mathbf {x}_{T-1} \qquad : \, {{\varvec{\pi }}}_T^j (\varvec{\xi }_T) \end{aligned}$$
$$\begin{aligned}&&\qquad\qquad\qquad\qquad\qquad\mathbf {x}_T \in \mathcal {X}_T( \mathbf {u},\varvec{\xi }_{T}). \end{aligned}$$

Note that \(\widetilde{Q}^{j,\nu }_T( \mathbf {x}_{T-1}, \varvec{\xi }_{T})=Q^{j}_T( \mathbf {x}_{T-1}, \varvec{\xi }_{T})\), i.e., there is no approximation in the last stage.

It is important to clarify the role of the auxiliary variable \(\mathbf {u}\) introduced in the problem, which appears in constraints (33), (34), (37), and (38). This variable is just a generic artifact to obtain a subgradientFootnote 4 of the function \(\widetilde{Q}^{j,\nu }_t( \mathbf {x}_{t-1}, \varvec{\xi }_{t})\) with respect to \(\mathbf {x}_{t-1}\). From (33) and (37) we see that this subgradient is given by the dual variable \({{\varvec{\pi }}}_t^j (\varvec{\xi }_t)\). Let \({\hat{\mathbf{x}}}_t^{\nu }\) be the optimal solution of (26)–(35) (and (36)–(38) in the case \(t=T\)) generated by the forward step in iteration \(\nu\) for stage t.

In the backward step, we solve (26)–(35) (and (36)–(38) in the case \(t=T\)) for each \(j \in \mathcal {K}\) and each scenario \(\varvec{\xi }_{t}^j = \varvec{\xi }_{t}^j(s)\), \(s\in \mathcal {S}_j\), with \(\mathbf {x}_{t-1}={\hat{\mathbf{x}}}_{t-1}^{\nu }\). Let \({\pi }_{t,s}^{\,j,\nu }:= {{\varvec{\pi }}}_t^j (\varvec{\xi }_t(s))\) denote the corresponding dual variable obtained from (33) (and from (37) in the case \(t=T\)). Then, we construct the Benders cut

$$\begin{aligned} \ell _{t}^{j,\nu }(\mathbf {x}_{t-1}) := \widetilde{\mathcal {Q}}_{t}^{\,j,\nu }({\hat{\mathbf{x}}}_{t-1}^{\nu })+\left( { \overline{\pi }}_{t}^{\,j,\nu }\right) ^\top (\mathbf {x}_{t-1} - {\hat{\mathbf{x}}}_{t-1}^{\nu }) \end{aligned}$$

for the function \(\widetilde{\mathcal {Q}}_{t}^{\,j,\nu }(\cdot ):=\sum _{s \in \mathcal {S}_j} \widetilde{Q}_{\,t}^{j,\nu }(\cdot , \varvec{\xi }_{t+1}(s)) \ q_j(s)\), using the average dual decision vector \({ \overline{\pi }}_{t}^{\,j,\nu } = \sum _{s \in \mathcal {S}_j}{\pi }_{t,s}^{\,j,\nu } \ q_j(s)\). As discussed earlier, \({\pi }_{t,s}^{\,j,\nu }\) is a subgradient of \(\widetilde{Q}_{\,t}^{j,\nu }(\cdot , \varvec{\xi }_{t+1}(s))\) at \({\hat{\mathbf{x}}}_{t-1}^{\nu }\) and thus it follows that \(\ell _{t}^{j,\nu }(\mathbf {x}_{t-1}) \ge \widetilde{\mathcal {Q}}_{t}^{\,j,\nu }(\mathbf {x}_{t-1})\) for all \(\mathbf {x}_{t-1}\). Moreover, since the function \(\mathcal {Q}_{t+1}^{\,k}(\mathbf {x}_t)\) in (20) is replaced with the upper approximation \(\overline{\mathcal {Q}}_{t+1}^{\,k,\nu }(\mathbf {x}_t)\) in (29), it follows that \(\widetilde{Q}_{\,t}^{j,\nu }(\mathbf {x}_{t-1}, \varvec{\xi }_{t+1}(s)) \ge Q_{\,t}^{j}(\mathbf {x}_{t-1}, \varvec{\xi }_{t+1}(s))\) for all \(s\in \mathcal {S}_j\) and thus \(\widetilde{\mathcal {Q}}_{t}^{\,j,\nu }(\mathbf {x}_{t-1})\ge \mathcal {Q}_{t}^{\,j}(\mathbf {x}_{t-1})\).

We then update the upper approximation of the value function (so it can be used in period \(t-1\)) as

$$\begin{aligned} \overline{\mathcal {Q}}_{t}^{\,k,\nu }(\mathbf {x}_{t-1})\ :=\ \min _{i=1,\ldots ,\nu } \ \ell _{t}^{k,i}(\mathbf {x}_{t-1}), \quad \forall k \in \mathcal {K}, \end{aligned}$$

so we see that \(\overline{\mathcal {Q}}_{t}^{\,k,\nu }(\mathbf {x}_{t-1})\ge \widetilde{\mathcal {Q}}_{t}^{\,j,\nu }(\mathbf {x}_{t-1})\ge \mathcal {Q}_{t}^{\,j}(\mathbf {x}_{t-1})\) for all \(\mathbf {x}_{t-1}\). It follows that when we solve (26)–(35) in period \(t-1\), by using (40) in constraint (29) we have the equivalent to the set of linear constraints

$$\begin{aligned} -\theta _k^- +\theta _k^+ -\eta \ - \ell _{t+1}^{k,i}(\mathbf {x}_{t})\ \le \ 0, \quad \forall k \in \mathcal {K}, \, \forall i=1,\ldots ,\nu . \end{aligned}$$

Therefore, the outer approximation of the SAA problem can be represented as model (26)–(35), with constraint (29) replaced with inequalities (41). Note also that, because of Assumption 4, the constraints given by (34) are linear in (\(\mathbf {x}_t,\mathbf {u})\). Thus, since \(f_t(\cdot ,\varvec{\xi }_t)\) and \(g_t(\cdot ,\varvec{\xi }_{t+1})\) are linear by Assumption 3, model (26)–(35) is just a linear program.

3.3 Deterministic lower bound

For standard SDDP applications, one can obtain a statistical lower bound by evaluating the current policy via Monte Carlo simulation and compute an estimator of the objective function, see details in [40]. However, it is not practical to obtain a statistical objective function assessment within the distributionally robust framework (10)–(11). The issue here is that, in order to evaluate the objective function in (10), we would need to know the optimal worst-case transition probability matrix in the corresponding inner problem, but this is not possible since we only have an approximation of value function \(Q_{t+1}^k\). Thus, if we simulate scenarios using any (suboptimal) transition probability matrix, the statistical evaluation of the objective function will not be a valid lower bound.

Our approach is to explore an extended inner approximation of \(Q_{t+1}^k(\cdot )\) to construct a valid lower bound to problem (17)–(24). The standard inner approximation method uses a convex combination of evaluated trial points instead of the Benders cuts (outer approximation). This approach was first proposed by [29] who ensured the feasibility of the convex combination by pre-evaluating all vertices of the uncertainty support, e.g., a multidimensional hypercube. The contribution by [29] notwithstanding, the approach proposed in that work is not efficient in practice since the number of vertices grows exponentially with the uncertainty dimension.

Consider the expected value function \(\mathcal {Q}_{\,t+1}^j(\mathbf {x}_{t})\) defined in (25), and suppose that in iteration \(\nu\) of the algorithm we have a concave lower (inner) approximation \(\underline{\mathcal {Q}}_{\,t+1}^{j,\nu }( \cdot )\) for \(\mathcal {Q}_{\,t+1}^j( \cdot )\) given by linear inequalities. Let \(\{ {\hat{\mathbf{x}}}^i_{t}\}_{i=1,\ldots ,\nu }\) denote the solutions obtained for each time \(t \in \{1,\ldots ,T\}\) from the previous \(\nu\) forward steps of the algorithm. As in the case of the outer approximation discussed in Sect. 3.2, the algorithm goes backwards in time, from \(t=T\) until \(t=0\), and \(\underline{\mathcal {Q}}_{\,t}^{j,\nu }\) is constructed from \(\underline{\mathcal {Q}}_{\,t+1}^{j,\nu }\). For \(t=T\) we set \({\underline{\mathcal {Q}}}_{\,T}^{j,\nu }(\mathbf {x}_{T-1}):=\sum _{ s \in \mathcal {S}_j} {Q}_{\,T}^j(\mathbf {x}_{T-1}, \varvec{\xi }^j_{T}(s)) \, q_j(s)\) in all iterations \(\nu\), where \(Q^j_T(\cdot )\) is defined in (9). Let \(\mathcal {R}\) denote the set of points satisfying constraint (11), and define, for \(t<T\),

$$\begin{aligned} \widehat{Q}_{\,t}^{j,\nu }(\mathbf {x}_{t-1}, \varvec{\xi }_t):= \underset{\mathbf {x}_{t} \in \mathcal {X}_t(\mathbf {x}_{t-1}, \varvec{\xi }_t) \cap \mathcal {R}}{\max } f_t(\mathbf {x}_t,\varvec{\xi }_t) + \left\{ \underset{\mathbf {p}_j\in \; \mathcal {P}_{j}}{\min } \sum \limits _{k\in \mathcal {K}} \underline{\mathcal {Q}}_{\,t+1}^{k,\nu }(\mathbf {x}_{t})\, p_j(k)\right\} \end{aligned}$$

and \({\hat{\mathcal{Q}}}_{t}^{{j,\nu }} ({\mathbf{x}}_{{t - 1}} ): = \sum\limits_{{s \in {\mathcal{S}}_{j} }} {\hat{Q}_{t}^{{j,\nu }} } ({\mathbf{x}}_{{t - 1}} , \varvec{\xi} _{t} (s))q_{j} (s)\). Note that, since \(\underline{\mathcal {Q}}_{\,t+1}^{j,\nu }( \cdot )\) is piecewise-linear concave, it follows from Assumptions 3 and 4 that \(\widehat{Q}_{\,t}^{j,\nu }(\cdot , \varvec{\xi }_t)\) and \({\widehat{\mathcal {{Q}}}}_{\,t}^{j,\nu }(\cdot )\) are also piecewise-linear concave. Moreover, since \(\underline{\mathcal {Q}}_{\,t+1}^{j,\nu }( \cdot )\) is a lower bound for \(\mathcal {Q}_{\,t+1}^{j}( \cdot )\), by comparing (10)–(11) and (42) we see that \(\widehat{Q}_{\,t}^{j,\nu }(\mathbf {x}_{t-1}, \varvec{\xi }_t)\le Q_{\,t}^j(\mathbf {x}_{t-1}, \varvec{\xi }_t)\) and thus it follows that

$${\hat{\mathcal{Q}}}_{t}^{{j,\nu }} ({\mathbf{x}}_{{t - 1}} ) \le {\mathcal{Q}}_{t}^{j} ({\mathbf{x}}_{{t - 1}} ).$$

Consider now the function \(\underline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1})\) defined as

$$\begin{aligned} \ \underline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1})= \underset{ \mathbf {x}',\varvec{\mu }}{\max } \quad&\sum _{i=1}^\nu \mu _{i} \, {\widehat{\mathcal{{Q}}}}^{j,\nu }_{\,t}({\hat{\mathbf{x}}}^i_{t-1}) - L\,\Vert \mathbf {x}'\Vert _1 \nonumber \\ \text{ s.t. }&\sum _{i=1}^\nu \mu _i {\hat{\mathbf{x}}}^i_{t-1} + \mathbf {x}' = \mathbf {x}_{t-1} \nonumber \\ &\sum _{i=1}^\nu \mu _i = 1 \nonumber \\ &\varvec{\mu }\ge 0 \end{aligned}$$

where L is a Lipschitz constant for \(\mathcal {Q}^{j}_{\,t}(\cdot )\) under the 1-norm. Proposition 5 shows that \(\underline{\mathcal {Q}}_{\,t}^{j,\nu }(\cdot )\) is indeed a valid lower bound for \(\mathcal {Q}_{\,t+1}^j( \cdot )\).

Proposition 5

The function \(\underline{\mathcal {Q}}_{\,t}^{j,\nu }(\cdot )\) defined in (44) is a piecewise-linear concave lower bound for \(\mathcal {Q}_{\,t}^j(\cdot )\), whenever L is a Lipschitz constant for \(\mathcal {Q}^{j}_{\,t}(\cdot )\) under the 1-norm.


For \(t=T\) we have \({\underline{\mathcal {Q}}}_{\,T}^{j,\nu }(\mathbf {x}_{T-1})={\mathcal {Q}}_{\,T}^{j}(\mathbf {x}_{T-1})\) by definition and so the statement is true. Suppose \(t<T\). Define now a function \(\overline{\underline{\mathcal {Q}}}^{j}_ t(\mathbf {x}_{t-1})\) similarly to (44), but with the function \({\widehat{\mathcal{{Q}}}}^{j,\nu }_{\,t}\) replaced by the true value function \(\mathcal {Q}^{j}_{\,t}\). From (43), it is clear that \(\underline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1})\le \overline{\underline{\mathcal {Q}}}^{j}_ t(\mathbf {x}_{t-1})\). Thus, it suffices to show that \(\overline{\underline{\mathcal {Q}}}^{j}_ t(\cdot )\) is lower bound for \(\mathcal {Q}_{\,t+1}^j( \cdot )\). Let \((\varvec{\mu }^*,\mathbf {x}^{*'})\) be an optimal solution to problem defining \(\overline{\underline{\mathcal {Q}}}^{j}_ t(\mathbf {x}_{t-1})\). Define the quantity \(\mathbf {x}^{*''}:=\sum _{i=1}^\nu \mu ^*_i {\hat{\mathbf{x}}}^i_{t-1}\), so we see that \(\mathbf {x}_{t-1}=\mathbf {x}^{*'}+\mathbf {x}^{*''}\). Since \(\mathcal {Q}^{j}_{\,t}(\cdot )\) is concave, we have that

$$\begin{aligned} \sum _{i=1}^\nu \mu ^*_{i} \, \mathcal {Q}^{j}_{\,t}({\hat{\mathbf{x}}}^i_{t-1}) \ \le \ \mathcal {Q}^{j}_{\,t}(\mathbf {x}^{*''})\ &\le \ \mathcal {Q}^{j}_{\,t}(\mathbf {x}_{t-1})+ \zeta _{\mathbf {x}_{t-1}}^\top (\mathbf {x}^{*''}-\mathbf {x}_{t-1}), \end{aligned}$$

where \(\zeta _{\mathbf {x}_{t-1}}\) is any subgradient of \(\mathcal {Q}^{j}_{\,t}(\cdot )\) at \(\mathbf {x}_{t-1}\). It follows from the right-most inequality in (45) that

$$\begin{aligned} \mathcal {Q}^{j}_{\,t}(\mathbf {x}_{t-1}) &\ge \ \mathcal {Q}^{j}_{\,t}(\mathbf {x}^{*''})- \zeta _{\mathbf {x}_{t-1}}^\top (\mathbf {x}^{*''}-\mathbf {x}_{t-1})\nonumber \\&\ge \ \mathcal {Q}^{j}_{\,t}(\mathbf {x}^{*''})- |\zeta _{\mathbf {x}_{t-1}}^\top (\mathbf {x}^{*''}-\mathbf {x}_{t-1})| \nonumber \\&\ge \ \mathcal {Q}^{j}_{\,t}(\mathbf {x}^{*''})- \Vert \zeta _{\mathbf {x}_{t-1}}\Vert _2 \, \Vert (\mathbf {x}^{*''}-\mathbf {x}_{t-1})\Vert _2 \end{aligned}$$
$$\begin{aligned}&\ge \ \mathcal {Q}^{j}_{\,t}(\mathbf {x}^{*''})- \Vert \zeta _{\mathbf {x}_{t-1}}\Vert _1 \, \Vert (\mathbf {x}^{*''}-\mathbf {x}_{t-1})\Vert _1 \end{aligned}$$
$$\begin{aligned}&\ge \ \mathcal {Q}^{j}_{\,t}(\mathbf {x}^{*''})- L \, \Vert \mathbf {x}^{*'}\Vert _1 \end{aligned}$$
$$\begin{aligned}&\ge \ \sum _{i=1}^\nu \mu ^*_{i} \, \mathcal {Q}^{j}_{\,t}({\hat{\mathbf{x}}}^i_{t-1}) - L \, \Vert \mathbf {x}^{*'}\Vert _1 \end{aligned}$$
$$\begin{aligned}&= \ \overline{\underline{\mathcal {Q}}}^{j}_ t(\mathbf {x}_{t-1}). \end{aligned}$$

The inequality in (46) is an application of the Cauchy–Schwarz inequality, whereas the inequality in (47) follows from the well-known fact that \(\Vert \mathbf {x}\Vert _2\le \Vert \mathbf {x}\Vert _1\) for any vector \(\mathbf {x}\). Inequality (48) follows from the assumption that L is a Lipschitz constant for \(\mathcal {Q}^{j}_{\,t}(\cdot )\) under the 1-norm and therefore the 1-norm of any subgradient of \(\mathcal {Q}^{j}_{\,t}(\cdot )\) is bounded above by L. Finally, the inequality in (49) follows from the left-most inequality in (45), and (50) is the definition of \(\overline{\underline{\mathcal {Q}}}^{j}_ t(\mathbf {x}_{t-1})\).

Consider again the function \(\underline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1})\) defined in (44). As discussed earlier the function \({\widehat{\mathcal{{Q}}}}^{j,\nu }_{\,t}(\cdot )\) is piecewise-linear concave, and the function \(-L\Vert \mathbf {x}\Vert _1\) is piecewise-linear concave as well. It follows that the function \(\underline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1})\) defined in (44) is also piecewise-linear concave. \(\square\)

Problem (44) enhances the formulation proposed by [29] in that it allows for the evaluation of the lower bound function at points that are not in the convex hull of the points previously generated by the algorithm, thereby avoiding the enumeration of the vertices of the uncertainty support as proposed in that work. Moreover, it is important to observe that the approach can be easily adapted to other settings such as the standard SDDP, or the SDDP with nested risk measures that can be linearized, such as \(\mathrm{CV@R}\). In the case of nested risk measures, the difficulty to obtain valid lower and upper bounds has long been recognized in the literature (see, e.g., [31, 44]).

We must also mention that the dual formulation of problem (44) can be written as

$$\begin{aligned} \underline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1}) &= \underset{ \psi ,\varvec{\zeta }}{\min\ } \psi + \varvec{\zeta }^\top \mathbf {x}_{t-1} \nonumber \\ &\text{ s.t. }\quad\psi + \varvec{\zeta }^\top {\hat{\mathbf{x}}}^i_{t-1} \ge {\widehat{\mathcal{{Q}}}}^{j,\nu }_{\,t}({\hat{\mathbf{x}}}^i_{t-1}), \quad \forall i = 1,\ldots , \nu \nonumber \\ &\qquad \Vert \varvec{\zeta }\Vert _\infty \le L, \end{aligned}$$

which corresponds to the lower bound function proposed by [1] (translated into the context of maximization problems). However, we argue that the primal formulation (44) facilitates the interpretation of the Lipschitz constant L since the decision variable \(\mathbf {x}'\) can be interpreted as a slack vector which has nonzero components whenever \(\mathbf {x}_{t-1}\) does not belong to the convex hull of \(\{ {\hat{\mathbf{x}}}^i_{t-1}\}_{i=1,\ldots ,\nu }\). The slack vector \(\mathbf {x}'\) then appears in the objective function with a sufficiently large penalty L. Such an approach opens the possibility of using other types of penalization of the slack variable \(\mathbf {x}'\), which could be problem-dependent but provide tighter bounds. For instance, we explore the specific structure of the dynamic asset allocation problem presented in Sect. 5 to propose a modified lower bound with a proper penalization of the slack variable that does not require computing a Lipschitz constant. Finally, in order for the present paper to be self-contained we have chosen to provide a proof of Proposition 5 from first principles, applying different proof techniques than those used by [1, 2].

We close this section by noting that \(\widehat{Q}_{\,t}^{j,\nu }(\mathbf {x}_{t-1}, \varvec{\xi }_t)\) in (42) can be computed as the solution of a linear program, similarly to (26)–(35) but with \(\overline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1})\) in (29) replaced with \(\underline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1})\). For more details about the deterministic lower and upper bounds algorithms see Appendix 1.

3.4 Convergence

We establish now the convergence of our proposed approach. The following theorem shows that the gap between the deterministic upper and lower bounds becomes zero after finitely many iterations.

Theorem 6

Consider the modified SDDP algorithm described in Sect. 3.2, with the upper bound \(\overline{\mathcal {Q}}_{\,t+1}^{k,\nu }(\mathbf {x}_{t})\) defined in (40). Consider the lower bound \(\underline{\mathcal {Q}}_{\,t+1}^{k,\nu }(\mathbf {x}_{t})\) defined in (44)). Suppose that the the transition probability matrix obtained from HMM is irreducible. Then, at some iteration \(\nu\), we have \(\overline{\mathcal {Q}}_{\,t+1}^{k,\nu }({\hat{\mathbf{x}}}^{\nu }_{t})= \underline{\mathcal {Q}}_{\,t+1}^{k,\nu }({\hat{\mathbf{x}}}^{\nu }_{t})\) for some feasible solution \(\{{\hat{\mathbf{x}}}^{\nu }_{t}\}_{t=0\ldots ,T}\).


The convergence of the outer approximation follows from the standard proof of convergence of the standard SDDP presented by [30]. In that paper, the authors show that the optimal solutions of the outer approximations converge to an optimal solution of the original problem in finitely many iterations, assuming that every scenario in the problem is eventually sampled in the forward pass. In our context, it follows from Assumptions 3 and 4 that the objective function of the “true” discretized problem is concave piecewise linear, which is the setting in [30]. Thus, if the transition probability matrix obtained from HMM is irreducible, then it is possible to generate any scenario with nonzero probability and hence the proof of [30] can be applied.

For the inner approximation we can use an inductive step backwards from \(t=T\) to \(t=1\). Suppose that, at some iteration \(\nu\), an optimal solution \(\{{\hat{\mathbf{x}}}^{\nu }_{t}\}_{t=0,\ldots ,T}\) of the outer problem is also an optimal solution of the original problem—as discussed above, one such solution is guaranteed to be found based on the arguments of [30]. That is, we have that \({\overline{\mathcal {Q}}}_{\,t}^{j,\nu }({\hat{\mathbf{x}}}^{\nu }_{t-1})={\mathcal {Q}}_{\,t}^{j}({\hat{\mathbf{x}}}^{\nu }_{t-1})\), \(t=T,\ldots ,1\). We will show by induction that \({\underline{\mathcal {Q}}}_{\,t}^{j,\nu }({\hat{\mathbf{x}}}^{\nu }_{t-1})={\mathcal {Q}}_{\,t}^{j}({\hat{\mathbf{x}}}^{\nu }_{t-1})\), \(t=T,\ldots ,1\), which immediately implies that the gap between upper and lower bounds is equal to zero.

As discussed in the proof of Proposition 5, for \(t=T\) we have \({\underline{\mathcal {Q}}}_{\,T}^{j,\nu }(\mathbf {x}_{T-1})={\mathcal {Q}}_{\,T}^{j}(\mathbf {x}_{T-1})\) for all \(\mathbf {x}_{T-1}\), so in particular the equality holds at \(\mathbf {x}_{T-1}={\hat{\mathbf{x}}}^{\nu }_{T-1}\). Suppose now that it holds for \(t+1\le T\). That is, we have \(\underline{\mathcal {Q}}^{j,\nu }_{\,t+1}(\mathbf {x}_{t})=\mathcal {Q}^{j}_{\,t+1}(\mathbf {x}_{t})\) for \(\mathbf {x}_{t}={\hat{\mathbf{x}}}^{\nu }_{t}\) and, from Proposition  5, \(\underline{\mathcal {Q}}^{j,\nu }_{\,t+1}(\mathbf {x}_{t})\le \mathcal {Q}^{j}_{\,t+1}(\mathbf {x}_{t})\) for \(\mathbf {x}_{t}\ne {\hat{\mathbf{x}}}^{\nu }_{t}\). It follows that \({\hat{\mathbf{x}}}^{\nu }_{t}\) is a maximizer of the problem in (42) when \(\widehat{Q}^{j,\nu }_{\,t}(\cdot ,\varvec{\xi }_t)\) is calculated at \(\mathbf {x}_{t-1}= {\hat{\mathbf{x}}}^{\nu }_{t-1}\) and thus we have that \({\widehat{\mathcal{{Q}}}}^{j,\nu }_{\,t}({\hat{\mathbf{x}}}^{\nu }_{t-1})=\mathcal {Q}^{j}_{\,t}({\hat{\mathbf{x}}}^{\nu }_{t-1})\). Hence, when calculating \(\underline{\mathcal {Q}}_{\,t}^{k,\nu }(\cdot )\) at \({\hat{\mathbf{x}}}^{\nu }_{t-1}\), by concavity of \(\mathcal {Q}^{j}_{\,t}(\cdot )\) the maximization problem in (44) puts weight \(\mu _{\nu }=1\) and thus we have \({\underline{\mathcal {Q}}}_{\,t}^{j,\nu }({\hat{\mathbf{x}}}^{\nu }_{t-1})={\mathcal {Q}}_{\,t}^{j}({\hat{\mathbf{x}}}^{\nu }_{t-1})\). \(\square\)

4 Assessing out-of-sample performance in a rolling horizon scheme

Most SDDP applications use a rolling horizon scheme to mitigate the end-effect of the terminal time stage. One way to interpret this usage is that the actual problem has an infinite horizon and is approximated by a finite horizon model with many time stages such that the “end of the world” has a small influence on the first stage decision. This is the case for long term energy planning, portfolio selection and asset-liability management problems, to name a few. In this section, we establish a generic out-of-sample evaluation framework and develop an acceleration scheme for the particular case of time-homogeneous models where the parameters of the problem (i.e. the functions \(f_t(\mathbf {x}_t,\varvec{\xi }_t) = f(\mathbf {x}_t,\varvec{\xi }_t)\) and \(g_t(\mathbf {x}_t,\varvec{\xi }_t) = g(\mathbf {x}_t,\varvec{\xi }_t)\), and the coefficients in the set \(\mathcal {X}_t\)) do not depend on the time period.

The framework for the rolling horizon scheme in a general setting can be described as follows. Consider a implementation horizon of length H and let \(t_1,\ldots ,t_H\) denote the times at which the model is solved and the corresponding first-stage optimal solution is implementedFootnote 5. A suitable way to emulate the actual decision making process is to concatenate five steps for a given time \(t \in \{t_1,\ldots ,t_H\}\): (i) the HMM parameters are estimated via the EM (expectation-maximization) algorithm using as input the sequence of observed uncertainty realization \((\varvec{\xi }_1,\ldots ,\varvec{\xi }_t)\); (ii) an SAA version of the problem is generated as in Sect. 3.2; (iii) a Markov state classification is performed. A simple classification method can be described as follows: consider the state with highest posterior probability of occurrence given the historical path of the process \(\{\varvec{\xi }_t\}\); then, use that state as the initial one, cf. Eq. (2)An easy classification method is to use as initial state the one with highest posterior probability of occurrence given the historical path of the process \(\{\varvec{\xi }_t\}\), see (2). In step (iv), the SDDP algorithm for the problem with T stages is run until convergence (according to Algorithm 1) for problem (17)–(24) assuming a given previous implemented decision \(\mathbf {x}_{t-1}\) and the current uncertainty realization \(\varvec{\xi }_t\). Note that, in step (iv), the SDDP policy is obtained assuming observed Markov states with the current state defined by the HMM classification. In step (v), the first-stage decision \(\mathbf {x}_t\) is implemented, the time t is updated and we go to step (i). Note that this procedure is computationally intensive since a SDDP is run until convergence for each time step of the simulation. Finally, for the implementation of optimal policy \(\mathbf {x}_t\) in step (v) it is necessary to use a method—step (iii)— to infer the initial state of the Markov chain (recall that such states are not observable).

For the time-homogeneous case, it is appropriate to use only the first stage problem to implement every decision in the rolling horizon scheme. This is motivated by the fact that the problem structure does not depend on the period. In this context, we propose a relatively fast evaluation framework that is divided into two parts: estimation and sampling, and out-of-sample evaluation. In the estimation and sampling part, the training dataset is used as input for the EM algorithm to estimate the HMM parameters, i.e., nominal transition probabilities and conditional probability distributions of the uncertain vector. Those conditional distributions are sampled using Latin Hypercube Sampling (LHS)—which typically performs better than Monte Carlo sampling method, as shown in [18]—to construct the SAA scenario tree. For an out-of-sample evaluation, a rolling horizon scheme is used over the testing dataset to simulate historical (out-of-sample) performance. In essence, we follow three steps for a given time t: (i) the Markov state classification is performed using (2); (ii) a SDDP is run until convergence (again, according to Algorithm 1) for problem (26)–(35) assuming an observable Markov chain with the current state defined by the HMM classifier constructed in the training phase, a given previous stage decision \(\mathbf {x}_{t-1}\) and the current uncertainty realization \(\varvec{\xi }_t\); (iii) the first stage decision \(\mathbf {x}_t\) is implemented, the time t is updated and we go to step (i). Note that the steps are very similar to the initial decision process laid out earlier; however, in the out-of-sample evaluation, we do not re-estimate the parameters of the HMM, nor do we generate a new SAA version of the model. Thus, the convergence of SDDP in step (ii) should be much faster as it can use the value function approximations constructed in the previous steps as described below.

Given that HMM parameters are fixed, the value function for each state and period remains the same and can be reused over the rolling horizon scheme. However, the value function might not be well approximated given the updated value of the initial condition \(\mathbf {x}_{t-1}\). Therefore, we use the current approximation of the value function of the first stage to perform a convergence test using deterministic upper and lower bounds (see Appendix 1) to evaluate the gap given the updated initial condition. If the gap is not sufficiently small, we restart the SDDP algorithm to improve the value function until it achieves a satisfactory gap. Once the algorithm converges, a current first-stage solution is obtained and implemented. The whole procedure is now repeated one-step ahead, given the previous optimal decision and the currently observed uncertainty realization. The whole evaluation process iterates until it reaches the last period to be simulated. This process is described in Fig. 1 assuming a fixed value for \(\varDelta\).

Fig. 1
figure 1

Flowchart for backtesting using distributionally robust SDDP framework

Determining an appropriate value of \(\varDelta\) a priori is difficult in general. Some papers in the DRO literature compute the level of ambiguity based on the number of data points (see, e.g., [7, 25]). However, this type of procedure assumes that the data points are independent and identically distributed (i.i.d.), an assumption that is likely not to hold in the settings we are considering in the present paper. In our approach, the HMM approximates the dynamics of the stochastic process and the ambiguity set accounts not only for estimation errors (which go to zero with the number of data points), but also for model misspecification. We suggest choosing \(\varDelta\) via a robustness tuning procedure that selects the value of \(\varDelta\) (among a relatively small number of candidates) with the best out-of-sample performance. For that, we split our data in: training, validation and testing datasets. In this context, the robustness tuning is a series of out-of-sample evaluation steps in the the validation dataset followed by a final out-of-sample evaluation step in the testing (hold-out) dataset. This is indeed the approach we used in the case study presented in the next section.

5 Case study: a risk-constrained dynamic asset allocation model

In this section, we illustrate an application of the framework laid out in the previous sections to an asset allocation problem. The model learns the asset returns from the data and solves a dynamic optimization problem where the goal is to maximize the expected final wealth, taking into account the transaction costs in each period. Other papers use learning approaches for this problem; for example, a regret-optimization approach is applied in [21] to find the best (single-period) portfolio choice using historical data as input. We build upon the work of [46], which allows us to use their results as a benchmark since that paper does not deal with out-of-sample performance. In Sect. 5.1, we recap the stochastic model for asset returns while in Sect. 5.2, we present an equivalent formulation for the risk constrained dynamic asset allocation model proposed by [46].

5.1 The HMM learning methodology for asset returns

The uncertain returns \(\mathbf{r_t}\) are represented by a Hidden Markov Model (HMM). In the context of the financial market, HMM methodology is frequently used to model asset returns [13, 14, 23]. Such paradigm postulates that the probability distribution of asset returns depends only on the current state of the market that evolve according to a discrete-time finite-state Markov Chain. Such states, however, cannot be observed, hence the need for a Hidden Markov Model. Conditionally on each state, the log-returns are independent and identically distributed, with distribution given by a multivariate Gaussian whose parameters are estimated from data. This modeling choice is suitable for financial time series since it empirically reproduces most of the stylized facts for asset return series [38]. As before, we denote by \(K_t\) the (random) Markov state at time t, by \(\mathcal {K}\) the set of states of the Markov chain and by \(\hat{P}\) the corresponding estimated transition matrix with dimension \(|\mathcal {K}|\times |\mathcal {K}|\), with \(\hat{p}_j(k)\) denoting the probability to transition from state j to state k.

5.2 A CV@R-constrained dynamic asset allocation model

The model proposed in [46] is a multistage stochastic program that maximizes, in each stage, the future value function that represents the conditional expectation of the terminal wealth, subject to a \(\mathrm{CV@R}\) constraint. Using the notation defined in (5)–(9), that model can be written as follows. Given an initial wealth \(W_0\) and the stochastic return process \(\mathbf {r}_t\), we denote \(\varvec{\xi }_t = (\mathbf{1+r}_t)\) and solve, for each possible initial state j, the problem

$$\begin{aligned} Q_0^j:= \underset{ \mathbf {x}_{0}\in \mathbb {R}^{N+1}_+}{\max }\,&\sum \limits _{k\in \mathcal {K}}\mathbb {E}\left[ Q_{1}^k\left( \mathbf {x}_0, \varvec{\xi }_{1} \right) \big | K_{t+1} = k\right] \hat{p}_j(k) \end{aligned}$$
$$\begin{aligned}\quad \text{ s.t. }\quad&\phi _{{\hat{\mathbf{p}}}_j}\left[ \varvec{\xi }_{1}^\top \mathbf {x}_{0}\right] \ \ge \ (1-\gamma ) W_0 \end{aligned}$$
$$\begin{aligned}&(\mathbf{1}+\tilde{\mathbf {c}})^\top \mathbf {x}_{0} = W_0 \end{aligned}$$

where \(\tilde{\mathbf {c}}\) is the vector containing the transaction cost rate for each asset, and \(Q_{t}^j\) (for \(t=1,\ldots ,T-1\)) is defined recursively as

$$\begin{aligned} Q_{t}^j(\mathbf {x}_{t-1}, \varvec{\xi }_t)\ = \ &\underset{ \mathbf {x}_{t} \in \mathcal {X}_t( \mathbf {x}_{t-1}, \varvec{\xi }_t)}{\max }\,&\sum \limits _{k\in \mathcal {K}}\mathbb {E}\left[ Q_{t+1}^k\left( \mathbf {x}_t, \varvec{\xi }_{t+1} \right) \big | K_{t+1} = k\right] \hat{p}_j(k) \end{aligned}$$
$$\begin{aligned}&\qquad\text{ s.t. } \quad \quad&\phi _{{\hat{\mathbf{p}}}_j}\left[ \varvec{\xi }_{t+1}^\top \mathbf {x}_{t}\right] \ \ge \ (1-\gamma ) \left( \varvec{\xi }_{t}^\top \mathbf {x}_{t-1} \right) , \end{aligned}$$

while the end-of-horizon function \(Q_{T}^j(\mathbf {x}_{T-1}, \varvec{\xi }_T) = \varvec{\xi }_{T}^\top \, \mathbf {x}_{T-1},\) defines the terminal wealth.

In particular, we assume a risk-free asset indexed by \(i=0\) with null return, i.e., \(P(r_{0,t} = 0)=1\) and, consequently, \(P(\xi _{0,t}=1)=1\), for all \(t \in \{1,\ldots ,T\}\). We only assume positive transaction cost rates for the risky assets by defining \(\tilde{\mathbf {c}} =(0,\mathbf {c})^\top\). Moreover, to simplify the discussion below we assume that all risky assets have the same transaction cost rate c, so we have \(\mathbf {c}=(c,c,\ldots ,c)^\top\). In this context, the set \(\mathcal {X}_t(\mathbf {x}_{t-1}, \varvec{\xi }_t)\) is defined as

$$\begin{aligned} \left\{ \mathbf {x}_t \in \mathbb {R}^{N+1}_+ \big | \exists \mathbf{b}_t, \mathbf {d}_t \in \mathbb {R}^N_+ : \begin{array}{l} x_{0,t} + \mathbf{(1+c)}^\top \mathbf{b}_{t} - \mathbf{(1-c)}^\top \mathbf {d}_{t} = x_{0,t-1} \\ x_{i,t} - b_{i,t} + d_{i,t} = \xi _{i,t}\,x_{i,t-1}, \quad \forall i \in \mathcal {A}.\\ \end{array} \right\} \end{aligned}$$

where \(x_{0,t}\) refers to a risk-free asset (cash) allocation while \(x_{i,t}\) for \(i > 0\) refers to risky asset allocations.

From (57), we see that the allocations \(\mathbf {x}_t \in \mathcal {X}_t(\mathbf {x}_{t-1}, \varvec{\xi }_t)\) satisfy the equation

$$\begin{aligned} \mathbf{1}^\top \mathbf {x}_{t} \ = \ { \varvec{\xi }}_{t}^\top \mathbf {x}_{t-1} - \mathbf {c}^\top (\mathbf{b}_{t} + \mathbf {d}_t), \end{aligned}$$

that is, the amount of money available at time t is the return of the investment made at time \(t-1\), minus the transaction costs of assets that were bought (\(\mathbf{b}_{t}\)) and sold (\(\mathbf{d}_{t}\)).

A few words about the above model are in order. First, notice that the objective functions in (52) and (55) maximize the expected future value of the allocation in each period, where the expectation is taken with respect to both the returns and the Markov states. Constraint (54) reflects the fact that the transaction costs are incurred before the returns are realized; thus, assuming that the initial wealth \(W_0\) is in cash, the initial allocation \(\mathbf{1}^\top \mathbf {x}_0\) plus the corresponding purchase costs must be equal to that amount. This constraint is generalized to an arbitrary time period t by means of the set \(\mathcal {X}_t(\mathbf {x}_{t-1}, \varvec{\xi }_t)\) defined in (57), which accounts for the transaction costs resulting from both purchases and sales of assets (note that \(b_{i,t}\) and \(d_{i,t}\) are never simultaneously positive as nothing is gained from buying and selling the same asset in a given time period).

Note that in problem (55)–(56) the values of \(\varvec{\xi }_t\) and \(\mathbf {x}_{t-1}\) are given. Thus, the wealth \(W_{t} = \varvec{\xi }_{t}^\top \mathbf {x}_{t-1}\)—prior to discounting transaction costs, cf. (58)—in period t is just a constant and hence by the translation-invariant property of coherent risk measures we have that \(\mathrm{CV@R}_{1-\alpha }\left[ W_t-W_{t+1}\right] = W_t + \mathrm{CV@R}_{1-\alpha }\left[ -W_{t+1}\right] = W_t - \phi _{{\hat{\mathbf{p}}}_j}[W_{t+1}]\). It follows that constraint (56) can be written as \(\mathrm{CV@R}_{1-\alpha }\left[ W_t-W_{t+1}\right] \le \gamma W_t\). That is, the constraint limits the loss between periods t and \(t+1\) to a percentage of the wealth at time t (note that constraint (53) applies the same idea at \(t=0\)). The parameter \(\gamma\) can then be interpreted as the level of risk-aversion of the decision-maker: at one extreme (\(\gamma =0\)) we have \(\mathrm{CV@R}_{1-\alpha }\left[ W_{t}- W_{t+1}\right] \le 0\) which in particular implies that \(P(W_{t+1}<w_t\,|\,W_t=w_t)\le \alpha\), i.e., the probability of a loss between periods t and \(t+1\) must be very low; at the other extreme (\(\gamma =1\)) we do not impose any risk constraints and so when there are no transaction costs the optimal portfolio will invest only in the asset(s) with highest expected return at each time t (“all eggs in the same basket”).

5.3 A novel lower bound for the dynamic asset allocation problem

Motivated by the primal inner-approximation presented in Sect. 3.3, we use the particular structure of the dynamic asset allocation problem to propose a novel upper bound exploring a convex combination of pre-evaluated points and a proper penalty function for values outside the associated convex hull. With this result at hand, we use the standard SDDP upper bound (outer approximation) to efficiently compute a deterministic optimality gap. Throughout this section the function \(Q_{t+1}^j\left( \mathbf {x}_{t}, \varvec{\xi }_{t+1} \right)\) corresponds to the DRO version of problem (52)–(56), defined as in (10)–(11) and its equivalent formulation (17)-(24). Recall also the expected value function \({\mathcal {Q}}_{\,t+1}^{k}(\mathbf {x}_{t})\) defined in (25).

As shown in [46], the asset allocation problem (52)–(56) has relatively complete recourse whenever \(\gamma \ge c\). Indeed, if the maximum allowed loss \(\gamma\) is at least the transaction cost rate, it is always feasible to sell all risky assets and adopt a risk-free strategy with null return: \(x_{o,t} = W_t\), and \(x_{i,t} = 0, \, \forall i \in \mathcal {A}\). Moreover, this feasible and simple strategy has a straightforward value function since the objective function, i.e., the terminal wealth \(W_T\), is equal to the current wealth (\(W_t = \varvec{\xi }_t^\top \mathbf {x}_{t-1}\)) minus the total transaction cost of selling the risky assets (\(\mathbf {c}^\top (\mathbf{b}_{t} + \mathbf {d}_t)\)), where \(d_{i,t} = \xi _{i,t}\,x_{i,t-1}\) and \(b_{i,t} = 0\) for every \(i \in \mathcal {A}\). This is shown formally in Proposition 7 below.

Proposition 7

Suppose that the parameter \(\gamma\) that appears on the right-hand side of (53) satisfies \(\gamma \ge c\), where c is the transaction cost rate. Then, \(Q^j_t(\mathbf {x}_{t-1},\varvec{\xi }_t) \ge (1-c)\, \varvec{\xi }_t^\top \mathbf {x}_{t-1}\) for all Markov states j.


Consider a fixed time period t. As the previous allocation vector \(\mathbf {x}_{t-1}\) and the realization \(\varvec{\xi }_t\) are given as parameters of \(Q^j_t\), we let \(W_t = \varvec{\xi }_t^\top \mathbf {x}_{t-1}\) denote the wealth right before buying and selling decisions at time t. Now, define the risk-free (sub-optimal) policy where all risky assets are sold at time t and the risk-free investment (with zero return) is held until the end of the horizon T. We use the superscript notation \(\mathbf {x}^{rf}_t,\ \mathbf{b}^{rf}_t\) and \(\mathbf {d}^{rf}_t\) to denote the values of these decision variables under the risk-free policy. Formally, the risk-free policy amounts to imposing that \(d^{rf}_{i,t} = \xi _{i,t}\,x_{i,t-1}\), \(b^{rf}_{i,t} = 0\) for every \(i \in \mathcal {A}\), and also \(\mathbf{b}^{rf}_{\tau }=\mathbf {d}^{rf}_{\tau }=\mathbf{0}\) for all \(\tau =t+1,\ldots ,T\). Using (58), the amount of money invested in the risk-free asset after buying and selling decisions at t is given by \(x^{rf}_{0,t} = \mathbf{1}^\top \mathbf {x}^{rf}_t = \ { \varvec{\xi }}_{t}^\top \mathbf {x}_{t-1} - \mathbf {c}^\top (\mathbf{b}^{rf}_{t} + \mathbf {d}^{rf}_t) = { \varvec{\xi }}_{t}^\top \mathbf {x}_{t-1} - c\,\sum _{i\in \mathcal {A}} \xi _{i,t}\,x_{i,t-1}.\)

Since the risk-free asset has null return (i.e., \(r_{0,t+1}=0\) and, consequently, \(\xi _{0,t+1}=1\)), the subsequent wealths can be calculated as \(W_{\tau } =\varvec{\xi }_\tau ^\top \mathbf {x}_{\tau -1}^{rf} = x_{0,\tau -1}^{rf}=x_{0,t}^{rf}, \forall \tau \in \{t+1,\ldots ,T\}\). Therefore, we have the terminal wealth \(W_T = x_{0,t}^{rf} = { \varvec{\xi }}_{t}^\top \mathbf {x}_{t-1} - c\,\sum _{i\in \mathcal {A}} \xi _{i,t}\,x_{i,t-1},\) which corresponds to the objective value of the risk-free (suboptimal) policy. Hence,

$$\begin{aligned} Q^j_t(\mathbf {x}_{t-1},\varvec{\xi }_t)\ \ge \ &{ \varvec{\xi }}_{t}^\top \mathbf {x}_{t-1} - c\left( \sum _{i\in \mathcal {A}} \xi _{i,t}\,x_{i,t-1}\right) \ge (1-c)\, \varvec{\xi }_t^\top \mathbf {x}_{t-1}. \end{aligned}$$


Similarly to the developments in Sect. 3.3, at iteration \(\nu\) of the algorithm we construct a concave lower (inner) approximation \(\underline{\mathcal {Q}}_{\,t+1}^{k,\nu }( \cdot )\) for \(\mathcal {Q}_{\,t+1}^k( \cdot )\) given by linear inequalities. We then define the function \(\widehat{Q}_{t}^j\left( {\hat{\mathbf{x}}}^i_{t-1}, \varvec{\xi }_{t} \right)\) as in (42) and compute its expectation \({\widehat{\mathcal{{Q}}}}_{\,t}^{j}({\hat{\mathbf{x}}}^i_{t-1}) = \mathbb {E}\left[ \widehat{Q}_{t}^j\left( {\hat{\mathbf{x}}}^i_{t-1}, \varvec{\xi }_{t} \right) \big | K_{t} = j\right]\).

Recall that \(\{ {\hat{\mathbf{x}}}^i_{t-1}\}_{i=1,\ldots ,\nu }\) denote the solutions obtained from the previous iterations of the algorithm. We add an initial point \({\hat{\mathbf{x}}}^0_{t} = \mathbf{0}\) for all t. Since \({\hat{\mathbf{x}}}^0_{t} = \mathbf{0}\) corresponds to having no wealth at all, it is clear that \({\widehat{\mathcal{{Q}}}}^{j,\nu }_{\,t}({\hat{\mathbf{x}}}^0_{t-1}):={\mathcal {Q}}_{\,t}^{j}({\hat{\mathbf{x}}}^0_{t-1}) = 0\) for every iteration \(\nu\), every time stage t and every Markov state j. We now devise a novel lower bound for the asset allocation problem.

Proposition 8

Let \({\overline{\varvec{\xi }}}_{t,j} = \mathbb {E}\left[ \varvec{\xi }_t\big | K_{t} = j\right]\) denote the conditional expectations of the returns. Suppose \(\gamma \ge c\). Then, the function

$$\begin{aligned}\underline{\mathcal {Q}}^{j,\nu }_ t(\mathbf {x}_{t-1})& := \underset{{\mathbf {x}'},\varvec{\mu }}{\max\ } \sum _{i=0}^\nu \mu _{i} \, {\widehat{\mathcal{{Q}}}}^{j,\nu }_{\,t}({\hat{\mathbf{x}}}^i_{t-1}) + (1-c)\, \overline{\varvec{\xi }}_{t,j}^\top \, {\mathbf {x}'} \nonumber \\ &\text{ s.t. }\sum _{i=0}^\nu \mu _i {\hat{\mathbf{x}}}^i_{t-1} + {\mathbf {x}'} = \mathbf {x}_{t-1} \nonumber \\ &\qquad \sum _{i=0}^\nu \mu _i = 1 \nonumber \\ & \qquad\varvec{\mu }, \mathbf {x}' \ge 0. \end{aligned}$$

is a lower bound for \({\mathcal {Q}}_{\,t}^{j}(\mathbf {x}_{t-1})\).


First, note that problem (59) is always feasible. Indeed, given that \({\widehat{\mathcal{{Q}}}}^{j,\nu }_{\,t}({\hat{\mathbf{x}}}^0_{t-1})=0\), the solution \(\mu _0 = 1\) and \(\mathbf {x}' = \mathbf {x}_{t-1}\) recovers the lower bound in Proposition 7. Then, for any feasible \(\mu _0,\ldots ,\mu _\nu\) and \(\mathbf {x}'\), we have that

$$\sum\limits_{{i = 0}}^{\nu } {\mu _{i} } {\hat{\mathcal{Q}}}_{t}^{{j,\nu }} ({\mathbf{\hat{x}}}_{{t - 1}}^{i} ) + (1 - c)\overline{\varvec{\xi }} _{{t,j}}^{{ \top }} {\mathbf{x^{\prime}}}\ \le\ \sum\limits_{{i = 0}}^{\nu } {\mu _{i} } {\mathcal{Q}}_{t}^{j} ({\mathbf{\hat{x}}}_{{t - 1}}^{i} ) + {\mathcal{Q}}_{t}^{j} ({\mathbf{x^{\prime}}})$$
$$\begin{aligned}&\ \le \ \mathcal {Q}^{j}_{\,t}\left( \sum _{i=0}^\nu \mu _{i} \,{\hat{\mathbf{x}}}^i_{t-1}\right) + {\mathcal {Q}}_{\,t}^{j}(\mathbf {x}' ) \end{aligned}$$
$$\begin{aligned}&\ \le \ {\mathcal {Q}}_{\,t}^{j}({\mathbf {x}}_{t-1} ). \end{aligned}$$

The inequality (60) holds since \({\hat{\mathcal{Q}}}_{t}^{{j,\nu }} ({\mathbf{\hat{x}}}_{{t - 1}}^{i} ) \le {\mathcal{Q}}_{t}^{{j,\nu }} ({\mathbf{\hat{x}}}_{{t - 1}}^{i} )\), and \((1-c)\, \overline{\varvec{\xi }}_{t,j}^\top \, \mathbf {x}' \le {\mathcal {Q}}_{\,t}^{j}(\mathbf {x}' )\), according to Proposition 7. Additionally, we use concavity to ensure inequality (61) while (62) is guaranteed since \({\mathcal {Q}}_{\,t}^{j}\) is positively homogeneous (proof in Appendix 3), and therefore superadditive. \(\square\)

5.4 Numerical results

To analyze how our approach behaves in practice, we test the model with realistic data. The data sets used in the experiments come from Kenneth R. French data setFootnote 6. The stocks from NYSE, AMEX, and NASDAQ are represented by capitalization-weighted indexes for each industry sector. We use monthly data of five industrial portfolios (“Cnsmr”, “Manuf”, “HiTec”, “Hlth” and “Other”). For simplicity, we use excess returns, i.e., the incremental return over the risk-free asset. This way, the risk-free asset presents \(r_{0,t} = 0, \forall t = 1,\ldots ,T\).

The framework was implemented in Julia language 0.6, using JuMP [11] and CPLEX to solve linear programming problems. All experiments were conducted on Intel Xeon E5-2680 2.7 GHz with 128GB RAM machine, while reported computational times are associated with single-core usage. The hmmlearn 0.2.0Footnote 7 library was used to construct the return distributions assuming that, conditional to each Markov state, log (excess) returns follow multivariate Gaussian distributions.

5.4.1 Results for the predictive model

The training dataset comprises 444 monthsFootnote 8 (prior to January 2007), while the dataset for historical simulation uses 96 months (from January 2007 to Setember 2014) to validate the proposed framework. Following [46], we select three Markov statesFootnote 9 and, conditional to each state, 750 return realizations obtained using Latin Hypercube Sampling of multivariate Gaussian distributions to construct the Sample Average Approximations of the problem. All simulations start with $1 in the risk-free asset, therefore if the strategy ends the simulation with $2 it implies an accumulated excess return of 100%. Figure 2 illustrates the posterior probability of each Markov state as in (1), and the solid line indicates the simulated wealth of the equal-weighted portfolio (as a proxy for the general behavior of the market) with cumulative return on the right axis.

Fig. 2
figure 2

Markov states and equal-weight portfolio wealth

Table 1 Markov transition matrix in percentage
Table 2 Mean percentage (standard-deviation in parenthesis) of asset returns conditional to each Markov state

A closer look at the Markov transition matrix Table 1, the individual asset returns and the corresponding standard deviations in Table 2, in conjunction with Fig. 2, allows us to infer how HMM is classifying the historical data and how to interpret the states. State 1 has low positive returns and a low probability of transitioning to state 3. State 2 has a high probability of transitioning to state 1, it also has higher returns than state 1, and it has more volatility. State 3 has negative returns, is almost absorbent with 90% chance to transition to itself, and has almost no probability of transition to state 1. Therefore, states 2 and 3 can be seen as bull and bear states, respectively. It is more difficult to infer the role of state 1. It seems to be a less volatile regular state since it is the most probable state during the whole simulation (Fig. 2).

5.4.2 Results for the prescriptive model

As discussed in Sect. 4, we implemented the algorithm in a rolling-horizon fashion. The horizon (number of periods) in each problem is \(T=16\) months with monthly decisions. To illustrate the convergence of the deterministic lower and upper bounds established in Theorem 6—with the lower bound calculated as in Proposition 7—Fig. 3 depicts the value of the bounds for an arbitrary run of the algorithm with \(\varDelta = 0.3\) and \(\gamma = 0.07\) for a maximum of 5000 iterations. In this example the final values of the deterministic lower and upper bounds were respectively 0.003388 and 0.003393, corresponding to an optimality gap of 0.1457%. In practice, we fixed an optimality relative gap of 1% as a stopping criterion. The time to converge the SDDP algorithm for the 1% gap was almost 3 hours for each period (or month). However, by applying the accelerated rolling-horizon procedure described in Sect. 4 that uses value function approximations constructed in the previous steps, from the second iteration onward the algorithm took less than 30 min per period.

Fig. 3
figure 3

Deterministic lower and upper bounds for \(\varDelta = 0.3\) and \(\gamma = 0.07\) starting from iteration 100

For the out-of-sample evaluation described in Sect. 4, we start with an estimated HMM, an SAA of the original problem, and the output of the algorithm after running until convergence—comprising a first stage problem and set of T future value functions. We shall denote the periods of the testing dataset \(t \in \{t_1,\ldots ,t_H\}\) and define \(R_t = \frac{W_t - W_{t-1}}{W_{t-1}}\) the portfolio percentage profit given by the proposed strategy at time t (recall that \(W_t:= \varvec{\xi }_t^\top \, \mathbf {x}_{t-1}^*\) is the corresponding wealth, prior to discounting transaction costs, cf. (58)). It is important to reiterate that the implemented decisions \(\mathbf {x}_t^*\) are obtained as the first stage solution of a T-stage problem given the current (inferred) Markov state—the most probable state given all information available up to t, cf. (2). In order to compare the out-of-sample performance of the different models, we use the “ex-post” average return of the portfolio strategy, \(\hat{R}_{EP}:=\frac{1}{H}\sum ^H_{t = 1} R_{t}\), and the “ex-post” \({\text{CV}}@{\text{R}}\) of the returns, defined as \(- \hat{\phi }_{EP}\), where \(\hat{\phi }_{EP} = \max _{z\in \mathbb {R}} \left\{ z -\frac{1}{H\alpha }\sum _{t \in \{t_1,\ldots ,t_H\}} \,(z-R_{t})_+ \right\}\) following the expression in (4). Note that a comparison of the ex-post CV@R with the parameter \(\gamma\) can be interpreted as an out-of-sample evaluation of constraint (56), since as remarked earlier, that constraint can be written as \(\mathrm{CV@R}_{1-\alpha }\left[ W_t-W_{t+1}\right] \le \gamma W_t\), i.e., \(-\phi _{{\hat{\mathbf{p}}}_j}\left[ R_{t+1}\right] \le \gamma .\) Despite the differences with the ex-ante counterparts, ex-post metrics are widely used, especially within the context of financial markets.

For a better assessment of out-of-sample performance, several experiments were done with different combinations of \(\gamma\) and \(\varDelta\). This can be viewed as a cross-validation procedure. It is important to stress the difference between these parameters. The former quantifies the decision-maker level of risk aversion (cf. (56)), whereas the latter establishes the confidence in the estimated distribution (cf. (12)). In this context, \(\gamma\) restricts the possible decisions, however even if the risk restriction is met, the confidence (or lack thereof) in the estimated probabilities (\({\hat{\mathbf{p}}}\)) will still impact the optimal portfolio decision.

We illustrate the compound effect of the ambiguity aversion (\(\varDelta\)) and the risk aversion (\(\gamma\)) coefficients over the optimal allocation. The effect of these coefficients can be seen in Fig. 4, where we present the optimal portfolio on a particular date as a function of \(\varDelta\) for a few values of \(\gamma\). We see, for example, that for a low value of \(\gamma\)—i.e., a more risk-averse decision-maker—the optimal portfolio is less sensitive to variations in \(\varDelta\), as the optimal portfolio puts a high percentage on the risk-free asset regardless of the value of \(\varDelta\). For a slightly less risk-averse decision-maker (\(\gamma =0.05\)) the optimal portfolio is diversified, with the components changing according to \(\varDelta\). Note that for values of \(\varDelta\) larger than 0.35, the ambiguity set includes all distributions and so the min-max problem will always assume the worst possible state, so it is not surprising that the optimal portfolio from that value of \(\varDelta\) on puts everything on the risk-free asset. With \(\gamma =0.1\), we essentially have a risk-neutral decision-maker, and so for most values of \(\varDelta\), the optimal portfolio consists only of the asset with the largest expected return, though that asset changes based on the level of confidence on the parameters of the HMM given by \(\varDelta\).

Fig. 4
figure 4

Allocation for different \(\varDelta\) values with \(\gamma\) equal to 0.01, 0.05 and 0.1

To further analyze this distinction between the parameters and how the Markov state impacts the final portfolio we show the portfolio allocation during the simulation for specific \(\gamma\) and \(\varDelta\) values. This comparison is depicted in Fig. 5 where we present the allocation policy for two \(\varDelta\) values, 0.0 and 0.2, with \(\gamma = 0.07\) for the whole simulation period. The choice for \(\gamma =0.07\) is because it is the ex-post CV@R corresponding to the equal-weight portfolio. The left axis shows the allocation in percentage of each asset, and the right axis shows the wealth for our DRO model and the equal-weight portfolio for comparison purposes. Some observations can be made: first, notice that during the period between 2008 and 2009 (which corresponds to the subprime crisis) the optimal portfolios for both values of \(\varDelta\) learn from the HMM that the market is in a bear state (cf. Fig. 2) and thus allocate almost everything into the risk-free asset. Second, while the equal-weight portfolio clearly dominates the portfolio for the case \(\varDelta =0\), it is outperformed by the robustified portfolio with \(\varDelta =0.2\) as the latter strategy yields better protection during the “bear” times and provides good diversification and good returns during the remaining periods.

Fig. 5
figure 5

Allocation during the simulation for \(\varDelta\) values 0.00 and 0.20 with \(\gamma = 0.07\)

The risk-return curves for different values of \(\gamma\) and \(\varDelta\) are shown in Fig. 6. Naturally, for portfolios with the same risk, the ones with more returns are preferred. Whereas, for portfolios with the same returns, the ones with less risk are preferred. In the figure, each line corresponds to one value of \(\varDelta\), whereas each dot corresponds to one value of \(\gamma\). We see that the efficient frontier consists of portfolios corresponding to \(\varDelta\) around 0.25–0.3, regardless of \(\gamma\). This, it appears—based on these experiments—that the right choice for \(\varDelta\) ensures good performance regardless of the decision-maker risk tolerance. The use of \(\gamma\), however, is still important for sensitivity purposes, as we can see that lower values of \(\varDelta\) combined with high values of \(\gamma\) can yield portfolios with inferior performance. Notice also that, as remarked earlier, values of \(\varDelta\) above 0.35 lead to excessive robustness as there is no trust in the HMM parameters, and thus the optimal portfolio consists only of the risk-free asset. That is, it is important to collect sufficient data, so one has some confidence in the HMM parameters, but it is better not to trust such parameters blindly.

Fig. 6
figure 6

Out-of-sample monthly average return and CV@R

As shown by other authors [9, 16, 25], the equal-weight portfolio is a good benchmark strategy to compare with as it has competitive out-of-sample performance, especially when the model faces extreme uncertainty or when the transaction costs are high. In Fig. 6, we see that the equal-weight portfolio is dominated by most strategies. It confirms the superior out-of-sample performance of our proposed model, except in the cases where the ex-post CV@R is high, although that can only occur when the pre-specified risk tolerance \(\gamma\) is high, so the decision-maker is nearly risk-neutral.

Finally, in order to assess the effect of \(\varDelta\) on the quality of the portfolio, we need to use a metric that summarizes risk and return. The ICV@R [15] was inspired by the Sharpe ratio in that it measures return by unit of risk. It is computed as the ratio between the average return \(\hat{R}_{EP}\) and the deviation between the average return and the average tail \(\hat{\phi }_{EP}\) [20], that is, \(ICV@R := \frac{\hat{R}_{EP}}{\hat{R}_{EP}- \hat{\phi }_{EP}}.\)

Figure 7 depicts the values of the ICV@R index for various values of \(\varDelta\) and \(\gamma\). We see that the ICV@R function is in most cases monotonically non-decreasing for \(\varDelta \le 0.325\) regardless of the value of \(\gamma\), which suggests again that in order to have better out-of-sample performance one should use higher (but not too high) \(\varDelta\) values. Moreover, we see again that values of \(\varDelta\) around 0.25–0.325 yield the highest values of ICV@R (and thus the best portfolios according to this criterion) regardless of the value of \(\gamma\). The figure also shows that the equal-weight portfolio is dominated by our distributionally robust approach for all cases with \(\varDelta \in [0.125,0.325]\).

Fig. 7
figure 7

Out-of-sample ICV@R and \(\varDelta\) for each \(\gamma\)

5.5 Testing procedure

As a final step in our case study, we applied the algorithm to testing (hold-out) data using a suitable robustness level \(\varDelta =0.3\) chosen according to the out-of-sample performances in the validation procedure (see Fig. 6). The testing data comprises the period from May-2019 to April-2020. The choice of the testing period was due to two goals: first, we wanted to leave some space between the validation data and the testing data to avoid any contamination; second, since we wanted to fully test the capacity of the model to react to adverse situations, we deliberately chose a period when the market suffered huge losses, as it was the case in March 2020 which coincided with the explosion of the COVID-19 pandemics. The risk aversion parameter was set to \(\gamma =0.07\), a value comparable to the equal-weight portfolio risk. Figure 8 depicts the optimal allocation strategy found by the algorithm for the model with \(\varDelta =0.3\). We see that the model invested on a mix of Health (“Hlth”) and risk-free assets for most of the year, except at the beginning and at the end of this testing period where it anticipated a potential downturn and consequently moved the allocation to 100% risk-free assets.

Fig. 8
figure 8

Out-of-sample May-2019 to April-2020 allocation for HMM-DRO policy with \(\varDelta =0.3\)

Figure 9 depicts two pieces of information related to the testing period. In the figure, the labels on the horizontal axis correspond to the situation at the end of each month. The shaded areas indicate the posterior probability of each Markov state as in (1), following the left vertical axis. We see that for the most of the testing period the HMM classified the market overwhelmingly as a mix of “regular” and “bull” states, until March 2020 when it turned the classification into a “bear” state as the COVID-19 crisis expanded worldwide. Such behavior suggests that the HMM was effective in learning the states of the market directly from the data.

Fig. 9
figure 9

Out-of-sample May-2019 to April-2020 performance comparison among policies and states of the HMM

The other piece of information displayed in Fig. 9 is a comparison among a number of policies, as it shows the accumulated monthly return (following the right vertical axis) during the testing period for (1) the HMM-based policy with \(\varDelta =0\); (2) the HMM-DRO-based policy with \(\varDelta =0.3\); (3) the policy that is obtained using an SDDP model with no HMM; and (4) the passive equal-weight strategy. We see that the strategy given by our HMM-DRO approach outperformed the equal-weight strategy for most of the year, except for the last month when the equal weight-strategy recovered more quickly from the downturn in March and benefited from the market rebound in April 2020. Moreover, the HMM-DRO policy outperformed the pure-HMM and the no-HMM ones. We see that although the pure-HMM and no-HMM policies performed reasonably well up to the point where the crisis started, after that point both policies suffered huge losses and never recovered. In contrast, the robust policy with \(\varDelta =0.3\) was able to weather the effects of the crisis much better. The graph also shows the merits of the HMM approach: indeed, the no-HMM policy had the worst performance of all for most of the period. Overall, the analysis suggests that, by using the HMM to learn about the different states of the market and by taking into account the estimation errors of the HMM, the proposed approach can indeed yield competitive performance during normal times and provide better protection during downturns.

6 Conclusions

The evolution of computing power, new theoretical results, and the development of specialized software tools have made stochastic dynamic optimization models widely applicable in recent years. In our opinion, however, the increase in the utilization of such models has not been accompanied by a similar development in the treatment of data. For most applications reported in the literature, some form for the underlying stochastic process \(\{\varvec{\xi }_t\}\) is assumed (after some study of available data), the problem is solved, and the optimal solution given by the model is implemented. Our goal is in this paper is to bring this practice closer to a new reality of data-driven problems, where information can be inferred automatically from the data via some machine learning technique, and an independent validation procedure is applied in order to evaluate the decisions yielded by the model.

To accomplish our goal, we have presented a framework for data-driven distributionally robust dynamic decision models with a particular structure that is applicable in many different contexts. Our approach combines a Hidden Markov Model (HMM) as the predictive engine with a dynamic Distributionally Robust Optimization (DRO) model as the prescriptive methodology. Notwithstanding the HMM flexibility to approximately capture the dynamics of a variety of stochastic processes, it is subject to estimation errors as well as model misspecification. Therefore, a distributionally robust dynamic optimization model is a suitable choice to embody the uncertainty dynamics represented by the HMM and at the same time to robustify decisions against the uncertainty over the HMM parameters. We have provided a tractable reformulation of the optimization problem and shown that we can adapt the well-known Stochastic Dual Dynamic Programming (SDDP) algorithm to solve the proposed model. Along the way, we have developed a deterministic lower bound (for a maximization problem), which, although related to recent literature, is a novel result that can be generalized to other multi-stage problems. Moreover, the bound has a practical appeal by allowing for user-defined simple policies evaluations to improve computational tractability and solution efficiency, especially when taken together with the deterministic upper bound provided by SDDP.

For a fixed robustness level, we have presented an evaluation framework to assess the out-of-sample performance of the optimal policies yielded by the model in a rolling horizon scheme. We have also introduced an acceleration scheme in case of computationally intensive problems, which is applicable when the problem structure does not depend on the time period. A robustness tuning procedure was proposed as a series of out-of-sample evaluation steps, whereby the robustness level with the best out-of-sample performance is selected. We have illustrated the power and flexibility of the proposed data-driven prescriptive analytics framework with a complete case study on dynamic asset allocation. The numerical results show superior out-of-sample performance against selected benchmarks on a hold-out testing dataset. The case study reiterates the practical importance and applicability of the proposed framework since it emulates the actual decision process of a dynamic asset allocation problem, extracting valuable information from data to obtain robust decisions with an empirical certificate of suitable out-of-sample performance.

While the case study focuses on one type of problem (dynamic asset allocation), we believe that the framework presented in the paper can be useful in other contexts as well. For example, in long-term energy planning—a type of problem for which SDDP has been extensively used—the stochastic input process (e.g., water inflows, solar radiation) could be inferred directly from the data, using machine learning techniques as described in this paper. We believe that the presented work raises important questions for future development on the integration of machine learning methods and dynamic optimization under uncertainty, and hope it will stimulate further research in this area.