The common part of the optimization models are the binary selection variables \(x_n\) and selection probabilities \(p_n\), with the following requirements:
$$\begin{aligned} \sum _n x_n&= S. \end{aligned}$$
(1)
$$\begin{aligned} p_{n}&\le P^\mathrm {max}\, x_n \qquad&n\in \mathcal {N} \end{aligned}$$
(2)
$$\begin{aligned} p_{n}&\ge P^\mathrm {min}\, x_n \qquad&n\in \mathcal {N} \end{aligned}$$
(3)
$$\begin{aligned} \sum _n p_{n}&= 1 \end{aligned}$$
(4)
Equation (1) ensures that we select exactly S data points and the rest causes the probabilities to be distributed only between the selected data points. Note that the lower bound \(P^\mathrm {min}\) is required to avoid zero probabilities of selected points, which would in effect mean less than S scenarios. On the other hand, \(P^\mathrm {max}\) is optional and serves to enforce a more even distribution of probabilities. One possibility is to introduce a new parameter \(\lambda \), and define
$$\begin{aligned} P^\mathrm {min}&= \frac{1}{\sqrt{\lambda }S} P^\mathrm {max}= \frac{\sqrt{\lambda }}{S} \end{aligned}$$
(5)
This guarantees that the highest probability is at most \(\lambda \) times larger than the smallest one, independent on S.
If we instead want equiprobable scenarios, we can simply replace (2)–(4) with
$$\begin{aligned} p_{n}&= \frac{1}{S}\, x_{n} \qquad n\in \mathcal {N} \,. \end{aligned}$$
(6)
In this case, it is possible to substitute \(p_{n}\) out of the model.
It is important to realize that this basic structure includes N binary variables, one for each data point we can select. This implies that there is a limit for how large data sets this approach will be applicable to.
Minimizing the Wasserstein distance
The scenario-selection problem can be seen as distributing probabilities from the original empirical distribution to only the S selected points. To minimize the Wasserstein distance, we want to do this while minimizing the amount of moved probability, multiplied by the move distances. For this, we only need to add the following to (1)–(4) or (1), (6):
$$\begin{aligned}&\text {minimize } \sum _{i j} \Vert D_{i} - D_{j}\Vert ^r \cdot \pi _{i j} \end{aligned}$$
(7)
$$\begin{aligned}&\sum _{j} \pi _{i j} = P_i \qquad i\in \mathcal {N} \end{aligned}$$
(8)
$$\begin{aligned}&\sum _{i} \pi _{i j} = p_{j} \qquad j\in \mathcal {N} \,, \end{aligned}$$
(9)
where \(P_i\) is the probability assigned to data point i (usually 1/N), \(\pi _{i j}\) is the probability moved from i to j, \(\Vert D_{i} - D_{j}\Vert \) is an appropriate metric, and r is the order of the Wasserstein distance. Note that the distances can be pre-computed, so they are not limited to linear formulas. Also note that (4) may be dropped, as it is implied by (8) and (9). The optimization problem (1)–(4) , (7)–(9) is referred to as the linear transportation problem in Heitsch and Römisch (2003) and optimal quantization in Pflug and Pichler (2015); Löhndorf (2016), suggesting its relation to different fields of science.
While this model is simple to implement, it does not scale well because of the added variables \(\pi _{i j}\): the model has N binary and \(N^2 + N\) continuous variables, so it becomes intractable beyond ca. thousand data points.
For this reason, one usually resolves to using a heuristic to solve the problem approximately. For this there are two common approaches: one, discussed here, uses a variant of the Lloyd’s algorithm (Lloyd 1982), while the second approach leads to the scenario-reduction techniques discussed in Sect. 3.5.
In this section, we present a modified version of Algorithm 2 from Pflug and Pichler (2015), which can be summarized as follows:
-
0.
initialize with selection set \(\mathcal {Z} ^0 = \{z^0_1,\dots ,z^0_S\}\), \(D^0 = \infty \), and \(k=1\).
-
1.
find Voronoi partition generated by \(\mathcal {Z} ^k\), i.e., sets \(\mathcal {V} ^k(z^k_s)\) for each \(s \in \{1,\dots ,S\}\).
-
2.
compute the Wasserstein distance \(D^k\) corresponding to \(\mathcal {Z} ^k\) and partitions \(\mathcal {V} ^k(z^k_s)\).
-
3.
stop if \(D^{k} \ge D^{k-1}\) (no more improvement)
-
4.
\(\mathcal {Z} ^{k+1} = \{z^{k+1}_1,\dots ,z^{k+1}_S\}\), where \(z^{k+1}_s\) is the ‘centre of order r’ of partition \(\mathcal {V} ^k(z^k_s)\).
-
5.
set \(k=k+1\) and goto 1.
Since we operate on a discrete set of points, the Voronoi partitions consist of the original data points \(D_i \in \mathcal {N} \), where each \(D_i\) is assigned to a partition of its closest \(z^k_s\). Step 1 of the algorithm therefore becomes:
$$\begin{aligned} D_i \in \mathcal {V} ^k\left( z_s^k\right) \text { if } s = \mathop {{\mathrm{arg}}\,{\mathrm{min}}}\limits _{s \in \mathcal {S} ^k} \Vert D_i - z^k_s\Vert \end{aligned}$$
and the Wasserstein distance in Step 2 becomes
$$\begin{aligned} D^k = \sum _s \sum _{D_i \in V^k\left( z^k_s\right) } P_i\, \Vert D_i - z^k_s\Vert ^r \,. \end{aligned}$$
Step 4, on the other hand, involves integration also in the discrete case. An important exception is the case where \(\Vert .\Vert \) is the Euclidean metric and \(r=2\). There, the partition centres coincide with conditional means, easily computed in the discrete case. In this case, the algorithm becomes the standard Lloyd’s algorithm for the k-means problem, which we treat separately in Sect. 3.4.
Here, we try another approach, where we require also the points \(z^k_s\) to be selected from the data set \(\mathcal {N} \), i.e., \(z^k_s = D_{i^k_s}\). In this case, Step 4 simplifies to
$$\begin{aligned} z_s^{k+1} = D_{i^{k+1}_s} = {\mathrm{arg}}\,{\mathrm{min}}_{D_i \in \mathcal {V} ^k\left( z^k_s\right) }\, \sum _{j \in \mathcal {V} ^k\left( z^k_s\right) } \Vert D_i - D_j\Vert ^r \,, \end{aligned}$$
making the whole algorithm easily calculable.Footnote 1 Moreover, all computed distances are between the original data points, so they can be pre-computed, speeding up the algorithm. When the algorithm stops, \(\mathcal {Z} ^k\) becomes the scenario set, with probabilities
$$\begin{aligned} P_{z^k_s} = \sum _{D_i \in \mathcal {V} ^k\left( z^k_s\right) } P_{D_i}, \end{aligned}$$
equal to \(|\mathcal {V} ^k(z^k_s)|\, / N\) if the data are equiprobable.
As for the initialization in Step 0, Pflug and Pichler (2015) recommend generating the initial set using Algorithm 1 from the same paper. However, in our case, this algorithm is slow compared to the rest. Therefore, we run Algorithm 2 from multiple randomly generated initial sets and use the selection with the smallest Wasserstein distance. This is significantly faster than using Algorithm 1 and our testing shows no difference in the final distance.
Note that the heuristic, unlike the MIP formulation, does not provide any control of the resulting probabilities. Hence, should our optimization problem require equiprobable scenarios, we would simply have to disregard the sizes of the Voronoi sets and assign the same probability to the all resulting scenarios. This could have negative impact on the quality of the approximation, since an outlier data point could end up in a set of its own and then get the same probability as other scenarios.
Minimizing difference in moments and correlations
Following Høyland and Wallace (2001), we use the first four moments and correlations, with the following definitions:
$$\begin{aligned}&\text {mean} \quad \mu _p \quad = \mathrm {E}\left[ D_{*,p}\right] \end{aligned}$$
(10)
$$\begin{aligned}&\text {variance} \quad \sigma _p^2 \quad = \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) ^2\right] \end{aligned}$$
(11)
$$\begin{aligned}&\text {skewness} \gamma _p \quad = \mathrm {E}\left[ \left( \frac{D_{*,p} - \mu _p}{\sigma }\right) ^3\right] = \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) ^3\right] / \sigma _p^3 \end{aligned}$$
(12)
$$\begin{aligned}&\text {kurtosis} \quad \kappa _p \quad = \mathrm {E}\left[ \left( \frac{D_{*,p} - \mu _p}{\sigma }\right) ^4\right] = \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) ^4\right] / \sigma _p^4 \end{aligned}$$
(13)
$$\begin{aligned}&\text {correlation} \quad \rho _{pq} \quad = \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) \left( D_{*,q} - \mu _q\right) \right] / \left( \sigma _p \, \sigma _q\right) \end{aligned}$$
(14)
where \(D_{*,p}\) denotes the p-th component of all data points, i.e., all data for parameter \(p \in \mathcal {P} = \{1,\dots ,P\}\).
Formulas (13)–(14) are nonlinear due to the scaling by \(\sigma _p\), so we use unscaled versions instead. Note that this means that if the sample variance of p differs from \(\sigma _p^2\), then the scaled versions will not be calculated correctly.
Since we have a discrete distribution, expected values simplify to sums over \(n \in \mathcal {N} \), with probabilities \(p_{n}\) from either (1)–(4) or (1), (6). In particular, the scenario-based expected value of the m-th power of parameter \(p \in \mathcal {P} \) is
$$\begin{aligned} r_{p,m} = \mathrm {E}\left[ D_{*,p}^m\right]&= \sum _n p_{n}\, D_{np}^m \end{aligned}$$
(15)
and, using the binomial expansion, the m-th central moment becomes
$$\begin{aligned} c_{p,m} = \mathrm {E}\bigl [\left( D_{*,p} - \mu _p\right) ^m\bigr ] = \sum _{k=0}^m \left( {\begin{array}{c}m\\ k\end{array}}\right) (-1)^{m-k}\, \mathrm {E}\bigl [D_{*,p}^k\bigr ] \mu _p^{m-k} \,. \end{aligned}$$
(16)
The first equation is linear in \(p_{n}\), since \(D_n\) is data. To make the second equation linear in \(\mathrm {E}\bigl [D_{*,p}^k\bigr ]\), and hence also in \(p_{n}\), we need \(\mu \) to be a constant. In other words, we have to replace the actual sample mean by its target value. Hence, the computed moments will be exact only if the sample mean is exactly equal to \(\mu _p\). Alternatively, we can avoid the approximation by matching directly the expected powers \(\mathrm {E}\bigl [D_{*,p}^m\bigr ]\).Footnote 2
For correlations, we use the fact that
$$\begin{aligned} \mathrm {E}\left[ \left( D_{*,p} - \mu _p\right) (D_{*,q} - \mu _q)\right]&= \mathrm {E}\left[ D_{*,p}\,D_{*,q}\right] - \mu _p\,\mu _q \end{aligned}$$
and
$$\begin{aligned} \mathrm {E}\left[ D_{*,p}\,D_{*,q}\right]&= \sum _n p_{n}\, D_{np}\,D_{nq} \,, \end{aligned}$$
where the second equation is linear in \(p_{n}\) since D is data.
Just like for the moments, we have the choice of matching the correlations, with \(\mu _i\) and \(\sigma _i\) replaced by their target values, or matching only \(\mathrm {E}\left[ D_i\,D_j\right] \).
Putting it all together, the complete model for the moment-based approach, in addition to (1)–(4) or (1), (6), is:
$$\begin{aligned} \text {minimize } \quad \sum _m W_m\, \sum _p \left( d_{pm}^+ + d_{pm}^-\right) + \sum _{p<q} W_{pq} \left( \delta _{pq}^+ + \delta _{pq}^-\right) \end{aligned}$$
(17)
subject to:
$$\begin{aligned} r_{pm}&= \sum _n p_{n}\, D_{pn}^m\ \quad p\in \mathcal {P} \!,\, m\in \mathcal {M} \end{aligned}$$
(18)
$$\begin{aligned} c_{pm}&= \sum _{k=0}^{m} \left( {\begin{array}{c}m\\ n\end{array}}\right) (-1)^{m-k}\, r_{pm}\, \mu _p^{m-k} \qquad p\in \mathcal {P} \!,\, m\in \mathcal {M} \end{aligned}$$
(19)
$$\begin{aligned} d_{pm}^+&\ge c_{pm} / \sigma _p^m - M_{pm} \qquad p\in \mathcal {P} \!,\, m\in \mathcal {M} \end{aligned}$$
(20)
$$\begin{aligned} d_{pm}^-&\ge M_{pm} - c_{pm} \qquad p\in \mathcal {P} \!,\, m\in \mathcal {M} \end{aligned}$$
(21)
$$\begin{aligned} r_{pq}&= \sum _n p_{n}\, D_{np}\,D_{nq} \qquad p,q\in \mathcal {P} \!,\, p\!<\!q \end{aligned}$$
(22)
$$\begin{aligned} s_{pq}&= \left( r_{pq} - \mu _p\,\mu _q\right) / \left( \sigma _p\,\sigma _q\right) \qquad p,q\in \mathcal {P} \!,\, p\!<\!q \end{aligned}$$
(23)
$$\begin{aligned} \delta _{pq}^+&\ge s_{pq} - \Sigma _{pq} \qquad p,q\in \mathcal {P} \!,\, p\!<\!q \end{aligned}$$
(24)
$$\begin{aligned} \delta _{pq}^-&\ge \Sigma _{pq} - s_{pq} \qquad p,q\in \mathcal {P} \!,\, p\!<\!q \end{aligned}$$
(25)
where \(\mathcal {M} = \{1,2,3,4\}\) is the set of considered moments, \(M_{pm}\) are the target values of the central moments of parameter \(p \in \mathcal {P} \), \(\mu _p = M_{p1}\) and \(\sigma _p=\sqrt{M_{p2}}\). In addition, we define \(r_{p0} = 1\) in (19). Finally, \(\Sigma _{pq}\) is the target value of the correlation between parameters p and q.
Moreover, \(W_m\) and \(W_{pq}\) are weights for distances in moments and correlations, respectively. Since we usually expect the sensitivity of an optimization problem to decrease with the order of the moment of the input data (see, e.g., Chopra and Ziemba 1993), it is natural to use a decreasing sequence of weights. Another argument for higher weights on mean and variance is that, as we have already seen, mismatch in those moments implies error in evaluation of the higher moments, as well as correlations. In our test case, we have used \(W = \{10,5,2,1\}\) and \(W_{pq} = 3\) for all p, q.
Note that any of \(r_{pm}\), \(c_{pm}\), \(r_{pq}\), and \(s_{pq}\) can be easily substituted out of the model, reducing the number of variables but creating a denser LP matrix. Our testing with FICO™ Xpress indicates that this has some impact on solution times, but the exact choice of what to substitute is likely to differ between solvers and solution algorithms, so we do not present the results here.
k-means clustering
The k-means clustering algorithm (MacQueen 1967; Lloyd 1982) is used to divide a data set into a given number of clusters. As we have seen in Sect. 3.1, it is also a special case of the heuristic for minimizing the mass-transportation problem (1)–(4) , (7)–(9), with Euclidean distance and \(r=2\). Once the algorithm finishes, we need to select a representative scenario from each cluster. For this, we simply use the scenario closest to the cluster’s mean. We then use the relative cluster sizes as the probabilities assigned to the scenarios.
The advantage of k-means is that it is a standard method with many available implementations and that the method is very fast. On the other hand, there is no guarantee that the resulting scenarios constitute a good approximation of the distribution.
Note that this method, just like the Wasserstein heuristic, does not provide control over the resulting cluster sizes and hence the scenario probabilities. However, there are alternative versions of the k-means method that provide some control over cluster sizes, such as the ‘constrained k-means clustering’ method from Bennett et al. (2000). The downside of this variant is longer run-time, compared to the standard k-means.
Sampling-based approaches
The ‘sampling and evaluate’ approaches provide another way to generate scenarios. One obvious advantage is their simplicity: we randomly sample scenarios from the data, compute their distance from the target distribution, and keep track of the best scenario set. They are also flexible, as they work with any distance measure that we can implement and that can be evaluated quickly enough.
For instance, this approach is well suited for use with moments and correlations, since these are both easy to implement and fast to evaluate (Seljom and Tomasgard 2015).
It is also possible to use it with the Wasserstein distance. With scenario selection given by the sample, we can remove \(x_n\) and all \(j \in \mathcal {N} \setminus \mathcal {S} \) from (1)–(4) , (7)–(9), which then simplifies to
$$\begin{aligned}&\text {minimize } \sum _{i \in \mathcal {N}, j \in S} \Vert D_{i} - D_{j}\Vert ^r \cdot \pi _{i j} \quad \quad (7') \\&\sum _{j \in \mathcal {S}} \pi _{i j} = P_i \qquad i\in \mathcal {N} \quad \quad \quad \qquad \quad \quad \quad (8') \\&\sum _{i \in \mathcal {N}} \pi _{i j} = p_{j} \qquad j\in \mathcal {S} \,,\quad \quad \qquad \quad \quad \quad (9') \\&p_{j} \ge P^\mathrm {min}\qquad j\in \mathcal {S} \quad \quad \quad \quad \qquad \qquad \quad (3') \end{aligned}$$
It turns out that this model can be solved analytically (Theorem 2 of, Dupačová et al. 2003), by setting
$$\begin{aligned} \pi _{i j} = {\left\{ \begin{array}{ll} 1 &{} \text {if } j = {\mathrm{arg}}\,{\mathrm{min}}_{j \in \mathcal {S}} \Vert D_{i} - D_{j}\Vert \\ 0 &{} \text {otherwise} \,, \end{array}\right. } \end{aligned}$$
as long as we do not require (3’)—which is no longer needed to ensure \(p_{j} > 0\), as the construction guarantees \(p_{j} \ge \min _i\, P_i\). Since \(\pi _{i j}\)’s are either zero or one, this solution assigns every input \(i\in \mathcal {N} \) to one of the selected points \(j\in \mathcal {S} \), in effect partitioning the set into clusters. This provides a justification for Step 1 of the heuristic presented in Sect. 3.1.
Note that no such analytical solution exists if we require fixed probabilities. In that case, we would have no choice but to solve the LP (7’)–(9’), with \(p_{j} = 1/S\), in each iteration of the sampling algorithm.
Scenario-reduction techniques
Scenario-reduction techniques from Dupačová et al. (2003) and Heitsch and Römisch (2003) provide an alternative way to approximately solve the linear transportation problem (1)–(4) , (7)–(9). They take advantage of the fact that the problem is easily solvable if we want to either select or remove a single point from the data, and come with two types of methods:
-
‘reduction’ methods, where we start from the whole data set and remove one point at a time until we reach the require size.
-
‘selection’ methods, where we start with an empty set and then add one point from the data at a time
In both papers, testing shows that the selection methods generate better scenarios (smaller Wasserstein distance from the data set), but can be slow for big S. Heitsch and Römisch (2003) therefore recommend to use the fast forward selection method as long as \(S \lessapprox N/4\). Since our tests will be well within this limit, we use this method (Algorithm 2.4 of, Heitsch and Römisch (2003)) in our tests.