1 Introduction

In stochastic programming, we typically need to discretize the distribution of uncertain parameters. In a two-stage setting, this means generating a set of scenarios that approximates an assumed ‘true’ distribution. In this paper, we consider the situation where the ‘true’ distribution is given by a set of historical data and the scenarios must be its subset. In other words, we want to form the scenarios by selecting points from the data set in such a way that the selection, with associated probabilities, is a good approximation of the empirical distribution of the whole data set.

The main motivation for this study are stochastic versions of the TIMES (Loulou et al. 2016; Loulou and Lettila 2016) and EMPIRE (Skar et al. 2016) energy-system models. There, we may have many years of hourly data for measurements of different types (such as wind-turbine and PV production) and from different locations/regions. The optimization model then requires a number of scenarios, each including values for one day, i.e., a sequence of 24 values for each measurement. While it is theoretically possible to model these sequences using some multi-variate stochastic process, it might not be easy to find a process that can correctly capture all spacial and temporal dependencies between the measurements. Using actual observations, on the other hand, guarantees that values within each scenario are consistent and realistic, since they actually happened (Seljom and Tomasgard 2015). The same reasoning can be applied also when each observation is just a single value (not a sequence), in cases with complicated non-linear dependencies between the different parameters.

Another possible situation where we want to generate scenarios by selection is when the model’s users provide their own data set. In such a case, the users may have more confidence in the model and its results if they know that it is using their actual data, instead of ‘synthetic’ values.

Note that the assumption that the data set represents the ‘true’ distribution excludes situations where we need to generate scenarios for conditional distributions, such as operational models where we want scenarios for now, given the latest data, or scenarios for distributions within a multi-stage scenario tree, conditional on values in the preceding nodes of the tree. These situations require a different approach and are therefore left for future research.

The requirement of only using existing values limits the selection of scenario-generation methods, as some result in ‘synthetic’ values. However, the following methods are applicable, either directly or with a slight modification:

  • sampling from the data set. By itself, this is not a suitable method for small scenario sets, as the sample distribution can be, by chance, very different from the whole data set. However, Seljom and Tomasgard (2015) show how to improve the match by sampling multiple scenario sets and then choosing the one whose distribution is closest to the data. They use the sum of differences in the first four marginal moments, as well as correlations, as their distance measure.

  • k-means clustering (MacQueen 1967; Lloyd 1982), where we divide the data set into a specified number of clusters. In the scenario-generation context, one would then normally use the the clusters’ means as scenarios (e.g., Munoz and Watson 2015). While this would result in synthetic values, we can easily avoid this by selecting the data points closest to the clusters’ means instead.

  • scenario-reduction methods (Dupačová et al. 2003; Heitsch and Römisch 2003), since the data can be interpreted as a large scenario set. Note that Rujeerapaiboon et al. (2018) call the data-selection approach ‘discrete scenario reduction problem’, which they contrast to ‘continuous scenario reduction problem’ where the values can be chosen freely.

In this paper, we present several alternative methods that fit into the described framework:

  • new MIP formulation of the moment-matching problem, i.e., minimizing the difference in moments and correlations between the scenarios and the data.

  • modification of Wasserstein-distance-minimizing heuristic from Pflug and Pichler (2015), adjusted for the case of data selection.

  • variant of the sampling-based approach from Seljom and Tomasgard (2015), using Wasserstein distance as a metric.

For the sake of clarity, we describe and test all the methods on a single-period case, i.e., the case where data points consist of a single value for each parameter.

The rest of the paper is organized as follows: we start by formulating the one-period problem in Sect. 2, followed by presentation of the scenario-generation methods in Sect. 3. In Sect. 4, we test the methods on a portfolio-optimization problem with CVaR as a risk measure. Finally, Sect. 5 discusses extending the methods to the multi-period case, i.e., the case where data consist of sequences of values for each parameter.

2 One-period selection problem

The scenario-selection problem can be summarized as follows: we have data set \(\mathcal {D}\) containing N data points for P parameters, i.e., \(D_n \in \mathbb {R}^P\) for \(n \in \mathcal {N} = \{1,\dots ,N\}\). From the set, we want to select S values such that the empirical distribution of the subset is as close to the empirical distribution of the whole set as possible. This raises two questions: how to measure the distance between distributions, and how to find a subset that minimizes the chosen metric.

For the distance measure, a natural choice seems to be the Kolmogorov-Smirnov statistic, i.e., the supremum of the absolute distance between the two distribution functions. Unfortunately, this distance does not scale well for multi-variate data, so it is not suitable for our purpose. Another choice is the Wasserstein (or Kantorovich–Rubinstein) metric, also known as the ‘earth mover’s distance’ in computer science. This metric has a known connection to scenario generation and stochastic programming, see for example Pflug (2001) and Pflug and Pichler (2011, 2014). Finally, we can follow Seljom and Tomasgard (2015) and measure the distance in terms of differences in marginal moments and correlations. This approach has also an existing connection to scenario generation, starting with Høyland and Wallace (2001).

Once we have chosen a measure, we have to find a method for identifying a subset that minimizes it. One option is the aforementioned ‘sample and evaluate’ approach from Seljom and Tomasgard (2015), i.e., to randomly select a large number of candidate subsets, evaluate the given measure (they use moments and correlations) on all of them and then select the one that minimizes it. While this approach works, it provides only statistical guarantees about the quality of the identified subset. Indeed, if there are only few subsets that are significantly better than the rest, they might not be discovered by this approach. Moreover, random selection implies equiprobable scenarios, which limits the achievable match.

For this reason, we propose to use optimization, in particular mixed integer linear programming (MIP), for the task. This is applicable both to the Wasserstein distance and, perhaps surprisingly, to the moment based approach, since we are selecting from a set of known values. For the same reason, it is possible to have the output probabilities as variables in both the MIP models, which should help to achieve a better match. This is a potential advantage over the ‘sample and evaluate’ approach.

In addition, we will test the already-mentioned k-means algorithm, where we split the data set into S clusters and then from each cluster select the data point that is closest to the cluster’s mean. We use the relative cluster sizes as probabilities assigned to the selected data points.

3 Optimization methods for the one-period case

The common part of the optimization models are the binary selection variables \(x_n\) and selection probabilities \(p_n\), with the following requirements:

$$\begin{aligned} \sum _n x_n&= S. \end{aligned}$$
(1)
$$\begin{aligned} p_{n}&\le P^\mathrm {max}\, x_n \qquad&n\in \mathcal {N} \end{aligned}$$
(2)
$$\begin{aligned} p_{n}&\ge P^\mathrm {min}\, x_n \qquad&n\in \mathcal {N} \end{aligned}$$
(3)
$$\begin{aligned} \sum _n p_{n}&= 1 \end{aligned}$$
(4)

Equation (1) ensures that we select exactly S data points and the rest causes the probabilities to be distributed only between the selected data points. Note that the lower bound \(P^\mathrm {min}\) is required to avoid zero probabilities of selected points, which would in effect mean less than S scenarios. On the other hand, \(P^\mathrm {max}\) is optional and serves to enforce a more even distribution of probabilities. One possibility is to introduce a new parameter \(\lambda \), and define

$$\begin{aligned} P^\mathrm {min}&= \frac{1}{\sqrt{\lambda }S} P^\mathrm {max}= \frac{\sqrt{\lambda }}{S} \end{aligned}$$
(5)

This guarantees that the highest probability is at most \(\lambda \) times larger than the smallest one, independent on S.

If we instead want equiprobable scenarios, we can simply replace (2)–(4) with

$$\begin{aligned} p_{n}&= \frac{1}{S}\, x_{n} \qquad n\in \mathcal {N} \,. \end{aligned}$$
(6)

In this case, it is possible to substitute \(p_{n}\) out of the model.

It is important to realize that this basic structure includes N binary variables, one for each data point we can select. This implies that there is a limit for how large data sets this approach will be applicable to.

3.1 Minimizing the Wasserstein distance

The scenario-selection problem can be seen as distributing probabilities from the original empirical distribution to only the S selected points. To minimize the Wasserstein distance, we want to do this while minimizing the amount of moved probability, multiplied by the move distances. For this, we only need to add the following to (1)–(4) or (1), (6):

$$\begin{aligned}&\text {minimize } \sum _{i j} \Vert D_{i} - D_{j}\Vert ^r \cdot \pi _{i j} \end{aligned}$$
(7)
$$\begin{aligned}&\sum _{j} \pi _{i j} = P_i \qquad i\in \mathcal {N} \end{aligned}$$
(8)
$$\begin{aligned}&\sum _{i} \pi _{i j} = p_{j} \qquad j\in \mathcal {N} \,, \end{aligned}$$
(9)

where \(P_i\) is the probability assigned to data point i (usually 1/N), \(\pi _{i j}\) is the probability moved from i to j, \(\Vert D_{i} - D_{j}\Vert \) is an appropriate metric, and r is the order of the Wasserstein distance. Note that the distances can be pre-computed, so they are not limited to linear formulas. Also note that (4) may be dropped, as it is implied by (8) and (9). The optimization problem (1)–(4) , (7)–(9) is referred to as the linear transportation problem in Heitsch and Römisch (2003) and optimal quantization in Pflug and Pichler (2015); Löhndorf (2016), suggesting its relation to different fields of science.

While this model is simple to implement, it does not scale well because of the added variables \(\pi _{i j}\): the model has N binary and \(N^2 + N\) continuous variables, so it becomes intractable beyond ca. thousand data points.

For this reason, one usually resolves to using a heuristic to solve the problem approximately. For this there are two common approaches: one, discussed here, uses a variant of the Lloyd’s algorithm (Lloyd 1982), while the second approach leads to the scenario-reduction techniques discussed in Sect. 3.5.

In this section, we present a modified version of Algorithm 2 from Pflug and Pichler (2015), which can be summarized as follows:

  1. 0.

    initialize with selection set \(\mathcal {Z} ^0 = \{z^0_1,\dots ,z^0_S\}\), \(D^0 = \infty \), and \(k=1\).

  2. 1.

    find Voronoi partition generated by \(\mathcal {Z} ^k\), i.e., sets \(\mathcal {V} ^k(z^k_s)\) for each \(s \in \{1,\dots ,S\}\).

  3. 2.

    compute the Wasserstein distance \(D^k\) corresponding to \(\mathcal {Z} ^k\) and partitions \(\mathcal {V} ^k(z^k_s)\).

  4. 3.

    stop if \(D^{k} \ge D^{k-1}\) (no more improvement)

  5. 4.

    \(\mathcal {Z} ^{k+1} = \{z^{k+1}_1,\dots ,z^{k+1}_S\}\), where \(z^{k+1}_s\) is the ‘centre of order r’ of partition \(\mathcal {V} ^k(z^k_s)\).

  6. 5.

    set \(k=k+1\) and goto 1.

Since we operate on a discrete set of points, the Voronoi partitions consist of the original data points \(D_i \in \mathcal {N} \), where each \(D_i\) is assigned to a partition of its closest \(z^k_s\). Step 1 of the algorithm therefore becomes:

$$\begin{aligned} D_i \in \mathcal {V} ^k\left( z_s^k\right) \text { if } s = \mathop {{\mathrm{arg}}\,{\mathrm{min}}}\limits _{s \in \mathcal {S} ^k} \Vert D_i - z^k_s\Vert \end{aligned}$$

and the Wasserstein distance in Step 2 becomes

$$\begin{aligned} D^k = \sum _s \sum _{D_i \in V^k\left( z^k_s\right) } P_i\, \Vert D_i - z^k_s\Vert ^r \,. \end{aligned}$$

Step 4, on the other hand, involves integration also in the discrete case. An important exception is the case where \(\Vert .\Vert \) is the Euclidean metric and \(r=2\). There, the partition centres coincide with conditional means, easily computed in the discrete case. In this case, the algorithm becomes the standard Lloyd’s algorithm for the k-means problem, which we treat separately in Sect. 3.4.

Here, we try another approach, where we require also the points \(z^k_s\) to be selected from the data set \(\mathcal {N} \), i.e., \(z^k_s = D_{i^k_s}\). In this case, Step 4 simplifies to

$$\begin{aligned} z_s^{k+1} = D_{i^{k+1}_s} = {\mathrm{arg}}\,{\mathrm{min}}_{D_i \in \mathcal {V} ^k\left( z^k_s\right) }\, \sum _{j \in \mathcal {V} ^k\left( z^k_s\right) } \Vert D_i - D_j\Vert ^r \,, \end{aligned}$$

making the whole algorithm easily calculable.Footnote 1 Moreover, all computed distances are between the original data points, so they can be pre-computed, speeding up the algorithm. When the algorithm stops, \(\mathcal {Z} ^k\) becomes the scenario set, with probabilities

$$\begin{aligned} P_{z^k_s} = \sum _{D_i \in \mathcal {V} ^k\left( z^k_s\right) } P_{D_i}, \end{aligned}$$

equal to \(|\mathcal {V} ^k(z^k_s)|\, / N\) if the data are equiprobable.

As for the initialization in Step 0, Pflug and Pichler (2015) recommend generating the initial set using Algorithm 1 from the same paper. However, in our case, this algorithm is slow compared to the rest. Therefore, we run Algorithm 2 from multiple randomly generated initial sets and use the selection with the smallest Wasserstein distance. This is significantly faster than using Algorithm 1 and our testing shows no difference in the final distance.

Note that the heuristic, unlike the MIP formulation, does not provide any control of the resulting probabilities. Hence, should our optimization problem require equiprobable scenarios, we would simply have to disregard the sizes of the Voronoi sets and assign the same probability to the all resulting scenarios. This could have negative impact on the quality of the approximation, since an outlier data point could end up in a set of its own and then get the same probability as other scenarios.

3.2 Minimizing difference in moments and correlations

Following Høyland and Wallace (2001), we use the first four moments and correlations, with the following definitions:

$$\begin{aligned}&\text {mean} \quad \mu _p \quad = \mathrm {E}\left[ D_{*,p}\right] \end{aligned}$$
(10)
$$\begin{aligned}&\text {variance} \quad \sigma _p^2 \quad = \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) ^2\right] \end{aligned}$$
(11)
$$\begin{aligned}&\text {skewness} \gamma _p \quad = \mathrm {E}\left[ \left( \frac{D_{*,p} - \mu _p}{\sigma }\right) ^3\right] = \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) ^3\right] / \sigma _p^3 \end{aligned}$$
(12)
$$\begin{aligned}&\text {kurtosis} \quad \kappa _p \quad = \mathrm {E}\left[ \left( \frac{D_{*,p} - \mu _p}{\sigma }\right) ^4\right] = \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) ^4\right] / \sigma _p^4 \end{aligned}$$
(13)
$$\begin{aligned}&\text {correlation} \quad \rho _{pq} \quad = \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) \left( D_{*,q} - \mu _q\right) \right] / \left( \sigma _p \, \sigma _q\right) \end{aligned}$$
(14)

where \(D_{*,p}\) denotes the p-th component of all data points, i.e., all data for parameter \(p \in \mathcal {P} = \{1,\dots ,P\}\).

Formulas (13)–(14) are nonlinear due to the scaling by \(\sigma _p\), so we use unscaled versions instead. Note that this means that if the sample variance of p differs from \(\sigma _p^2\), then the scaled versions will not be calculated correctly.

Since we have a discrete distribution, expected values simplify to sums over \(n \in \mathcal {N} \), with probabilities \(p_{n}\) from either (1)–(4) or (1), (6). In particular, the scenario-based expected value of the m-th power of parameter \(p \in \mathcal {P} \) is

$$\begin{aligned} r_{p,m} = \mathrm {E}\left[ D_{*,p}^m\right]&= \sum _n p_{n}\, D_{np}^m \end{aligned}$$
(15)

and, using the binomial expansion, the m-th central moment becomes

$$\begin{aligned} c_{p,m} = \mathrm {E}\bigl [\left( D_{*,p} - \mu _p\right) ^m\bigr ] = \sum _{k=0}^m \left( {\begin{array}{c}m\\ k\end{array}}\right) (-1)^{m-k}\, \mathrm {E}\bigl [D_{*,p}^k\bigr ] \mu _p^{m-k} \,. \end{aligned}$$
(16)

The first equation is linear in \(p_{n}\), since \(D_n\) is data. To make the second equation linear in \(\mathrm {E}\bigl [D_{*,p}^k\bigr ]\), and hence also in \(p_{n}\), we need \(\mu \) to be a constant. In other words, we have to replace the actual sample mean by its target value. Hence, the computed moments will be exact only if the sample mean is exactly equal to \(\mu _p\). Alternatively, we can avoid the approximation by matching directly the expected powers \(\mathrm {E}\bigl [D_{*,p}^m\bigr ]\).Footnote 2

For correlations, we use the fact that

$$\begin{aligned} \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) (D_{*,q} - \mu _q)\right]&= \mathrm {E}\left[ D_{*,p}\,D_{*,q}\right] - \mu _p\,\mu _q \end{aligned}$$

and

$$\begin{aligned} \mathrm {E}\left[ D_{*,p}\,D_{*,q}\right]&= \sum _n p_{n}\, D_{np}\,D_{nq} \,, \end{aligned}$$

where the second equation is linear in \(p_{n}\) since D is data.

Just like for the moments, we have the choice of matching the correlations, with \(\mu _i\) and \(\sigma _i\) replaced by their target values, or matching only \(\mathrm {E}\left[ D_i\,D_j\right] \).

Putting it all together, the complete model for the moment-based approach, in addition to (1)–(4) or (1), (6), is:

$$\begin{aligned} \text {minimize } \quad \sum _m W_m\, \sum _p \left( d_{pm}^+ + d_{pm}^-\right) + \sum _{p<q} W_{pq} \left( \delta _{pq}^+ + \delta _{pq}^-\right) \end{aligned}$$
(17)

subject to:

$$\begin{aligned} r_{pm}&= \sum _n p_{n}\, D_{pn}^m\ \quad p\in \mathcal {P} \!,\, m\in \mathcal {M} \end{aligned}$$
(18)
$$\begin{aligned} c_{pm}&= \sum _{k=0}^{m} \left( {\begin{array}{c}m\\ n\end{array}}\right) (-1)^{m-k}\, r_{pm}\, \mu _p^{m-k} \qquad p\in \mathcal {P} \!,\, m\in \mathcal {M} \end{aligned}$$
(19)
$$\begin{aligned} d_{pm}^+&\ge c_{pm} / \sigma _p^m - M_{pm} \qquad p\in \mathcal {P} \!,\, m\in \mathcal {M} \end{aligned}$$
(20)
$$\begin{aligned} d_{pm}^-&\ge M_{pm} - c_{pm} \qquad p\in \mathcal {P} \!,\, m\in \mathcal {M} \end{aligned}$$
(21)
$$\begin{aligned} r_{pq}&= \sum _n p_{n}\, D_{np}\,D_{nq} \qquad p,q\in \mathcal {P} \!,\, p\!<\!q \end{aligned}$$
(22)
$$\begin{aligned} s_{pq}&= \left( r_{pq} - \mu _p\,\mu _q\right) / \left( \sigma _p\,\sigma _q\right) \qquad p,q\in \mathcal {P} \!,\, p\!<\!q \end{aligned}$$
(23)
$$\begin{aligned} \delta _{pq}^+&\ge s_{pq} - \Sigma _{pq} \qquad p,q\in \mathcal {P} \!,\, p\!<\!q \end{aligned}$$
(24)
$$\begin{aligned} \delta _{pq}^-&\ge \Sigma _{pq} - s_{pq} \qquad p,q\in \mathcal {P} \!,\, p\!<\!q \end{aligned}$$
(25)

where \(\mathcal {M} = \{1,2,3,4\}\) is the set of considered moments, \(M_{pm}\) are the target values of the central moments of parameter \(p \in \mathcal {P} \), \(\mu _p = M_{p1}\) and \(\sigma _p=\sqrt{M_{p2}}\). In addition, we define \(r_{p0} = 1\) in (19). Finally, \(\Sigma _{pq}\) is the target value of the correlation between parameters p and q.

Moreover, \(W_m\) and \(W_{pq}\) are weights for distances in moments and correlations, respectively. Since we usually expect the sensitivity of an optimization problem to decrease with the order of the moment of the input data (see, e.g., Chopra and Ziemba 1993), it is natural to use a decreasing sequence of weights. Another argument for higher weights on mean and variance is that, as we have already seen, mismatch in those moments implies error in evaluation of the higher moments, as well as correlations. In our test case, we have used \(W = \{10,5,2,1\}\) and \(W_{pq} = 3\) for all p, q.

Note that any of \(r_{pm}\), \(c_{pm}\), \(r_{pq}\), and \(s_{pq}\) can be easily substituted out of the model, reducing the number of variables but creating a denser LP matrix. Our testing with FICO™ Xpress indicates that this has some impact on solution times, but the exact choice of what to substitute is likely to differ between solvers and solution algorithms, so we do not present the results here.

3.3 k-means clustering

The k-means clustering algorithm (MacQueen 1967; Lloyd 1982) is used to divide a data set into a given number of clusters. As we have seen in Sect. 3.1, it is also a special case of the heuristic for minimizing the mass-transportation problem (1)–(4) , (7)–(9), with Euclidean distance and \(r=2\). Once the algorithm finishes, we need to select a representative scenario from each cluster. For this, we simply use the scenario closest to the cluster’s mean. We then use the relative cluster sizes as the probabilities assigned to the scenarios.

The advantage of k-means is that it is a standard method with many available implementations and that the method is very fast. On the other hand, there is no guarantee that the resulting scenarios constitute a good approximation of the distribution.

Note that this method, just like the Wasserstein heuristic, does not provide control over the resulting cluster sizes and hence the scenario probabilities. However, there are alternative versions of the k-means method that provide some control over cluster sizes, such as the ‘constrained k-means clustering’ method from Bennett et al. (2000). The downside of this variant is longer run-time, compared to the standard k-means.

3.4 Sampling-based approaches

The ‘sampling and evaluate’ approaches provide another way to generate scenarios. One obvious advantage is their simplicity: we randomly sample scenarios from the data, compute their distance from the target distribution, and keep track of the best scenario set. They are also flexible, as they work with any distance measure that we can implement and that can be evaluated quickly enough.

For instance, this approach is well suited for use with moments and correlations, since these are both easy to implement and fast to evaluate (Seljom and Tomasgard 2015).

It is also possible to use it with the Wasserstein distance. With scenario selection given by the sample, we can remove \(x_n\) and all \(j \in \mathcal {N} \setminus \mathcal {S} \) from (1)–(4) , (7)–(9), which then simplifies to

$$\begin{aligned}&\text {minimize } \sum _{i \in \mathcal {N}, j \in S} \Vert D_{i} - D_{j}\Vert ^r \cdot \pi _{i j} \quad \quad (7') \\&\sum _{j \in \mathcal {S}} \pi _{i j} = P_i \qquad i\in \mathcal {N} \quad \quad \quad \qquad \quad \quad \quad (8') \\&\sum _{i \in \mathcal {N}} \pi _{i j} = p_{j} \qquad j\in \mathcal {S} \,,\quad \quad \qquad \quad \quad \quad (9') \\&p_{j} \ge P^\mathrm {min}\qquad j\in \mathcal {S} \quad \quad \quad \quad \qquad \qquad \quad (3') \end{aligned}$$

It turns out that this model can be solved analytically (Theorem 2 of, Dupačová et al. 2003), by setting

$$\begin{aligned} \pi _{i j} = {\left\{ \begin{array}{ll} 1 &{} \text {if } j = {\mathrm{arg}}\,{\mathrm{min}}_{j \in \mathcal {S}} \Vert D_{i} - D_{j}\Vert \\ 0 &{} \text {otherwise} \,, \end{array}\right. } \end{aligned}$$

as long as we do not require (3’)—which is no longer needed to ensure \(p_{j} > 0\), as the construction guarantees \(p_{j} \ge \min _i\, P_i\). Since \(\pi _{i j}\)’s are either zero or one, this solution assigns every input \(i\in \mathcal {N} \) to one of the selected points \(j\in \mathcal {S} \), in effect partitioning the set into clusters. This provides a justification for Step 1 of the heuristic presented in Sect. 3.1.

Note that no such analytical solution exists if we require fixed probabilities. In that case, we would have no choice but to solve the LP (7’)–(9’), with \(p_{j} = 1/S\), in each iteration of the sampling algorithm.

3.5 Scenario-reduction techniques

Scenario-reduction techniques from Dupačová et al. (2003) and Heitsch and Römisch (2003) provide an alternative way to approximately solve the linear transportation problem (1)–(4) , (7)–(9). They take advantage of the fact that the problem is easily solvable if we want to either select or remove a single point from the data, and come with two types of methods:

  • ‘reduction’ methods, where we start from the whole data set and remove one point at a time until we reach the require size.

  • ‘selection’ methods, where we start with an empty set and then add one point from the data at a time

In both papers, testing shows that the selection methods generate better scenarios (smaller Wasserstein distance from the data set), but can be slow for big S. Heitsch and Römisch (2003) therefore recommend to use the fast forward selection method as long as \(S \lessapprox N/4\). Since our tests will be well within this limit, we use this method (Algorithm 2.4 of, Heitsch and Römisch (2003)) in our tests.

4 Test: portfolio-optimization problem with CVaR

In this section, we test all the approaches presented in the previous section on a simple portfolio-optimization problem, with CVaR as a risk measure (Rockafellar and Uryasev 2000; Uryasev 2000). We put CVaR into the objective, with risk weight \(W_R\), so the model is:

$$\begin{aligned}&\text {maximize } \left( 1 - W_R\right) \sum _{s \in \mathcal {S}} P_s\, y_s + W_R \, c \end{aligned}$$
(27)
$$\begin{aligned}&\sum _i x_i \le B \quad i \in \mathcal {I} \end{aligned}$$
(28)
$$\begin{aligned}&x_i \le W_I \, B\quad i \in \mathcal {I} \end{aligned}$$
(29)
$$\begin{aligned}&y_s = \sum _{i} x_i \, \left( 1 + R_{is}\right) - B \quad s \in \mathcal {S} \end{aligned}$$
(30)
$$\begin{aligned}&z_s \ge \zeta - y_s&s \in \mathcal {S} \end{aligned}$$
(31)
$$\begin{aligned}&c = \zeta - \frac{1}{\alpha }\sum _{s \in \mathcal {S}} P_s \, z_s \end{aligned}$$
(32)

There, \(x_i\) is the amount invested to instrument \(i \in \mathcal {I} \), limited by a budged B plus maximum fraction invested into one instrument, \(W_I\). \(R_{is}\) are the instrument returns in scenario \(s \in \mathcal {S} \), so \(y_s\) is the expected profit. \(\zeta \) represents a threshold, equal to value-at-risk (VaR) at the optimum. Therefore, \(z_s\) is the profit below VaR and c the conditional value-at-risk (CVaR) with confidence level \(\alpha \). In our test, we use \(\alpha = 0.05\), \(B = 1000\), and \(W_I = 0.25\).

For our data, we use 5 years of daily close prices for the 25 largest stocks on the Oslo stock market, with data starting in September 2015. From this, we compute weekly price returns, resulting in 1248 data points. We consider three different problem dimension, by using the first 10, 20, or 25 stocks, and test four different scenario-set sizes, \(S \in \{10, 20, 50, 100\}\).

Both Wasserstein-based approaches use the Euclidean norm in calculation of the distances, and \(r=2\). For the moment-based optimization model, we set the parameter \(\lambda \) from (5) to 10. Moreover, since Xpress struggles to close the optimality gap of the model (1)–(4) , (17)–(25), we stop it after a specified time limit (see below).

The scenario-generation framework has been implemented in Python. The MIP model were written in Pyomo (Hart et al. 2011, 2017) and solved using FICO™ Xpress solver. All source files are freely available from https://github.com/SINTEF/scengen-data-select.

We test the following scenario-generation methods and variants:

  • WD heuristic’ is the heuristic described in Sect. 3.1. The method is initialized with 10 different random selections.

  • opt. 300 s’ is the moment-based optimization with a time limit of 300 s.

  • opt. 1800 s’ is the moment-based optimization with a time limit of 1800 s.

  • sampling 1x’ is using just a single sample, to serve as a benchmark.

  • sampl. MM 500x’ is selecting from 500 samples, using the moment-based dist.

  • sampl. MM 10000x’ is the same with 10000 samples.

  • sampl. WD 500x’ is selecting from 500 samples, using the Wasserstein metric.

  • sampl. WD 10000x’ is the same with 10000 samples.

  • k-means’, using Python implementation from scikit-learn (Pedregosa et al. 2011). This involves starting the method from 10 different random selections.

  • scen. reduction’, using our Python implementation of the fast forward selection method.

For each combination of scenario-generation method, dimension, and scenario-set size, we generated 25 trees—with the exception of optimization and scenario reduction, where we generate only one since they would all be equal. This gives \((7 \times 25 + 3) \times 3 \times 4 = 2136\) scenario sets.

Fig. 1
figure 1

Average scenario-generation times, in seconds

4.1 Generation times

The average generation times are reported in Fig.1. We do not show the moment-based optimizations, because they run until their time limit, and the sampling-based methods with 10,000 iterations, since this simply takes \(20\times \) longer than 500 iterations. All reported times are from a laptop with 2.8 GHz Intel® i7 processor and 16 GB of RAM.

We see that k-means is the fastest method, followed by the Wasserstein heuristic. However, it should be pointed out that the k-means method in scikit-learn is highly optimized, while our implementation of the Wasserstein heuristic is not. Of the sampling methods, the moment-based sampling method is significantly faster than the one using Wasserstein distance. Finally, scenario reduction is the slowest of the shown methods.

We can see that runtime of k-means and all Wasserstein-based methods, including scenario reduction, does not depend on dimension. This is expected, since the dimension enters only into the calculation of distances and otherwise does not influence the algorithm. Otherwise, the generation time increases approximately linearly with the number of scenarios, i.e., the number of clusters in k-means and the Wasserstein-distance heuristic.

The situation is reversed for the moment-based sampling, with generation times independent on the number of scenarios and increasing with dimension. The latter is expected, as higher dimension means more moments and correlations to evaluate and compare. Since the time to calculate moments and correlations must increase with the number of scenarios, the result suggests that it is not a significant factor in the overall run-time.

4.2 Out-of-sample evaluations

Next, we perform the out-of-sample evaluations of the trees, in the sense defined in Kaut and Wallace (2007). We do this for six different values of the risk weight, \(W_R \in \{0,0.1,\dots ,0.5\}\). For each \(W_R\), we solve the optimization problem (27)–(32) on all the generated trees. Next, we take the \(6 \times 2136 = 12\,816\) scenario-based solutions and evaluate them on the full data set, i.e., solve the problem using the data set as scenarios, with investment variables x fixed to the values from those solutions. This gives us out-of-sample evaluations of the solutions, and hence a measure of the quality of the scenarios sets they originated from.

In addition, we solve the problem on the data set without any restrictions to get the ‘true’ optimal solution, for each value of \(W_R\).

4.2.1 Difference in objective value

Since we have, for each scenario set and \(W_R\), both the out-of-sample evaluation and the actual ‘true’ optimal objective value, we can subtract the two to get a measure of sub-optimality caused by the scenario set; following Pflug and Pichler (2011), we will refer to it as the discretization error, but it is also known as the approximation error (Pflug 2001), or policy error (Löhndorf 2016). Then we can compare distributions of the discretization errors across all the test cases, for scenario sets generated by different methods.

Fig. 2
figure 2

Distribution of discretization errors across all the generated scenario sets. The inside boxes in each plot represent the inner quartiles, with the median denoted by the white dot. Moreover, the middle line shows the estimated range, limited to 1.5 times the inter-quartile range beyond the quartiles

Fig. 3
figure 3

Zoomed-in version of Fig. 2

The results are presented in Figs. 2 and 3, where the former shows the whole distributions, while the latter zooms in for easier comparisons of medians and quartiles. We omit the sampling-based methods with 10,000 iterations, since there was no visible improvement compared to the variants with 500 iterations.

Unsurprisingly, using a random sample is the worst method. Comparing the two methods selecting from 500 samples, we can see that moments and correlations are a better selection criteria than the Wasserstein distance, for our optimization problem.

The Wasserstein-distance heuristic is slightly better than any of the sampling-based methods, but worse than both k-means and the moment-based optimization. Furthermore, for the optimization method, we see that giving the solver more time helps to shift the distribution downwards, even though the median remains unchanged.

Finally, scenario reduction is clearly the best method overall, with median discretization error almost half that of the next-best methods.

We have also tested splitting the test cases by dimension, number of scenarios, and values of the risk weight. From this, we could conclude that the above observations hold for most subsets. The biggest difference was when we considered only sets with 10 scenarios, where the moment-based optimization method with 300-second limit performed almost as bad as random sampling. This is probably due to fact that with 10 scenarios, it is difficult to get a good match in means and variances, causing errors in evaluations of the other moments, as mentioned in Sect. 3.2.

On the other hand, the moment-based method with 1800-second limit was better than scenario reduction for \(W_R \in \{0, 0.1\}\). For the risk-neutral case (\(W_R=0\)), its median discretization error is virtually zero, suggesting that it found the actual optimal value in (at least) 50% of cases.

Apart from that, we can observe that:

  • dimension matters: the discretization error is several times bigger for 25 assets, compared to 10;

  • more scenario helps: as expected, the discretization error decreases with the number of scenarios;

  • risk-aversion matters: the discretization error increases with \(W_R\);Footnote 3

Moreover, the nature of the optimization problem allows us to plot the solutions in a mean-risk plane and compare them to the ‘true’ efficient frontier. This way, we can get some additional insights in the way the different methods work.

For example, comparing Figs. 4 and 5 shows that compared to k-means, the Wasserstein-based sampling results in portfolios with higher risk, i.e., with worse CVaR (solutions ‘to the left’ of the efficient frontier). This suggests that its scenarios do not represent the distributions’ tails as well as those based on k-means.

5 Extending the methods to multi-period sequences

So far, we have limited ourselves to the case of single-period data. However, as we have pointed out in the start of the paper, our motivation for the data-selecting approach came from the case where we select whole sequences, such as hourly values for a day or a week. In this section, we will show how to extend the methods from Sect. 3 to handle this.

The extension means that our data set \(\mathcal {D}\) now contains N data series for P parameters, with each series containing T values, i.e., \(D_{np} \in \mathbb {R}^T\) for each \(n \in \mathcal {N} \) and \(p \in \mathcal {P} \). However, this data can be interpreted as having N data points for \(P \times T\) parameters, converting any temporal dependencies into spacial ones. This way, we can convert the problem into the previous one, at the cost of increasing the dimension T times. For the above-mentioned cases of hourly values for one day or week, this means 24- or 168-fold increase in dimension, which could have significant implications for the scenario-generation methods.

Fortunately, both k-means and Wasserstein-distance-based methods are basically independent of dimension, as seen in Fig. 1. This is because they can operate on a pre-computed set of distances between the data points, so the dimension enters only into pre-processing.

For the methods based on moments and correlations, on the other hand, the amount of evaluated distances grows quadratically with the dimension because of correlations. This will make the sampling with moment matching significantly slower, and the optimization model most likely unsolvable for most practical problems.

Fig. 4
figure 4

Out-of-sample expected profit and CVaR for solutions based on scenarios generated using the moment-based ‘sampl. MM 500x’ method, with 25 assets. The point size correspond to the size of the scenario sets, while colour denotes the risk-weight parameter \(W_R\). The crosses show the optimal solution for all tested values of \(W_R\), so they lie on the mean-CVaR efficient frontier. Note the log-scale on the x-axes (CVaR)

Fig. 5
figure 5

Version of Fig. 4 for the Wasserstein-based ‘sampl. WD 500x’ method

There is also another issue to consider: with such a significant increase in dimension, we cannot really expect to get a very good match, unless we increase the number of scenarios correspondingly. In many cases, there might be some natural way to aggregate the sequences in order to decrease the dimension and therefore improve the achievable match. For example, if we are selecting days from a data set with hourly inflows, matching the distribution of total daily inflows could be enough—while the fact that we use historical data would ensure that the intra-day profiles are realistic. Similarly, if we generate scenarios for wind- or solar-power capacity factor, then it could suffice to match the distribution of the daily average values (so we have a realistic distribution of good and bad days). We could also choose a middle way and do a partial aggregation, by using only a couple of values per day (day/night, peak/off-peak, etc).

In other words, while we can use the same methods as for the single-period case, they require some extra considerations. We will present a study describing generation of multi-period scenarios for the TIMES energy-system model in an forthcoming paper.

6 Conclusions

In this paper, we have presented five different methods for selecting representative data points or sequences from historical data, in order to obtain a good approximation of the empirical distribution represented by the data. Such methods are needed in cases where the model users do not want to use synthetic values and therefore allow only historical data in scenarios for a given optimization model.

The methods range from off-the-shelf k-means algorithm, through easy-to-implement scenario reduction and iterative sampling methods with two different distance measures, to new optimization models and a Wasserstein-based heuristic.

We have then compared the methods using a simple portfolio-optimization model with CVaR as a risk measure. In our case, the ‘fast forward selection’ scenario-reduction method worked best, followed by k-means that was fastest. However, these results are likely to be problem-dependent. For example, the sampling-based methods generally produce equiprobable scenarios, so they cannot provide as good a match as methods with adjustable probabilities. On the other hand, k-means and the Wasserstein-based methods, including scenario reduction, cannot control the output probabilities—if we want equiprobable scenarios, we have to disregard the computed probabilities, which is likely to have an impact on the quality of the scenarios. Consequently, should we generate scenarios for an optimization model that requires equiprobable scenarios, such as the TIMES energy-system model, the sampling-based methods would loose their disadvantage while other methods would not work optimally, so the conclusion could change.

The main contribution of the paper is therefore not the results of the particular test case, but rather the presentation of the methods, including formulation of the moment-based optimization model as well as the Wassertein-metric-based heuristic. In addition, the presented ‘sample and evaluate’ methods can be useful for comparing different distance measures and identifying the most relevant measure for a given optimization model.

There are two natural avenues for future research: the first is to test the presented methods on a large-scale model, such as TIMES, and the second is to develop methods that can handle conditional distributions. The latter would allow usage for operational models (conditional on the latest data and possibly a forecast), as well as multi-stage models with dependences between stages.