1 Introduction

Using a discrete approximation of a continuous random variable (rv) is a procedure that is often adopted in many problems where uncertainty is present and needs to be taken into account. Substituting a continuous probability density function (pdf) with an approximating probability mass function (pmf), supported on a (possibly) finite number of points, can heavily reduce the computational burden required for determining a numerical solution for the problem at hand, and can produce an approximate solution whose degree of accuracy is still acceptable. This is particularly true when a problem involves several quantities that should be modelled as continuous rvs and the solution depends on some complex function thereof. The exact solution in this case could be obtained by applying some multivariate numerical integration technique whose computational cost may be severe and dramatically increasing with the number of rvs involved. Approximating each continuous rv through a properly chosen discrete rv allows the researcher to avoid numerical integration and resort to enumeration, which is much easier to manage (Luceno 1999).

An interesting application of discrete approximation of continuous distributions can be found in the field of insurance. If one is required to determine the distribution of the total claim amount \(S=\sum _{i=1}^N X_i\) corresponding to a random number N of i.i.d. claims whose size \(X_i\) is assumed to follow some continuous probability distribution, rather than resorting to integral convolution to determine the exact distribution of S, one can approximate the distribution of the claim size \(X_i\) through discretization and then apply Panjer’s formula (Panjer 1981), which is a recursive formula that exactly calculates the distribution of the total, holding when the claim size is arithmetic with span size \(h>0\) and the distribution of the number of claims N belongs to the (ab, 0) class.

In the field of quantitative finance, a similar application is the following. Let \(\pmb {L} = (L_1, \dots , L_d)\) denote a vector of possibly dependent rvs, each one representing a loss on a particular trading desk, portfolio or operating unit within a firm, over a fixed time period. Sometimes we need to aggregate these losses into a single rv, typically the sum \(L^+=\sum _{i=1}^d L_i\), on which one can calculate a measure of the aggregate risk, for example, the Value-at-Risk, which is nothing else than the quantile at a prespecified level \(0<\alpha <1\) of the distribution of \(L^+\). However, determining this measure of risk, when the joint distribution of \(\pmb {L}\) is fully specified, requires the computation of the distribution of \(L^+\), which is not straightforward to derive even if all the rvs are mutually independent. A possible answer is represented by the discretization of the \(L_i\) and the construction of a joint pmf approximating the joint distribution of \(\pmb {L}\), on which the evaluation of the VaR of the aggregating function is much more straightforward (see, for example, Jamshidian and Zhu (1996), where the authors consider the multivariate Gaussian distribution).

In reliability engineering, the stress-strength model describes a system with random strength which is subject to a random stress during its functioning, so that the system works only when the strength is greater than the stress. The probability that a system correctly works is termed reliability (Johnson 1988). Evaluating the reliability of a system thus requires the knowledge of the probability distributions of both stress and strength; when these latter depend on several stochastic factors, the probability distributions (and then the reliability) are often not analytically tractable and hence some form of approximation is needed. Given a known functional relationship between stress (or strength) and its random subfactors and assuming the subfactors of the stress (or strength) are independent, one feasible approach of approximating the probability distribution of stress (or strength) is through discretization of the subfactors  (see, e.g., English et al. 1996).

For an assigned continuous probability distribution and a fixed number of approximating points k, the matter is how to build an “optimal” discrete approximation. Several criteria have been used so far; here is a rough classification into four main categories:

  • moment equalization or moment matching: the discrete approximation is the one preserving as many moments as possible of the original distribution. Moments’ matching is carried out through a procedure known as Gaussian quadrature (Golub and Welsch 1969); in its more authentic form, the support points of the discrete approximation and their probabilities are derived simultaneously. This is by far the oldest and most popular discretization technique; as stated by Miller and Rice (1983), “Few people would accept an approximation that did not have roughly the same mean, variance, and skew as the original distribution”. However, its application is limited by the finiteness and existence of a closed-form expression for the first integer moments of the continuous rv: many random distributions commonly used in quantitative finance, for example, do not possess even lower-order moments. Several variants of Gaussian quadrature have been proposed; for example, Tanaka and Toda (2013) suggested that the k support points of the approximating distribution have to be chosen “a priori” and the probabilities have to be derived by maximizing the relative entropy to an assigned “reference distribution”, under the constraint that it matches as many moments as possible. Convergence properties of this maximum entropy method and applications to stochastic processes were later presented in Tanaka and Toda (2015), Farmer and Toda (2017);

  • preservation of the distribution function: the discrete approximation preserves, at each support point, the value of the cdf or, alternatively, of the survival function (Roy and Dasgupta 2001). Actually, this technique has been employed for constructing discrete counterparts of continuous distributions, by defining its pmf as \(p(i)=F(i)-F(i-1)\), with i integer; such construction automatically supplies a valid pmf, which preserves the cdf of the continuous distribution at any integer value i (Chakraborty 2015). Straightforward modifications have been proposed in order to handle finite supports consisting of possibly non-integer points;

  • minimizing the mean squared error between the assigned continuous rv \(X\sim F\) and its approximation \({\hat{X}}\), i.e., minimizing \({\mathbb {E}}_F(X-{\hat{X}})^2\), where the expected value is computed with respect to the distribution of X; this yields the so-called optimal quantization, which is a well-known technique in signal theory (Gray and Neuhoff 1998). The minimization problem can be solved iteratively, by alternately solving two sets of conditions, one expressing each support point (also named “quantum”) as the center of mass of the intervals of a partition of the support of the continuous rv, and the other expressing the endpoint of each of these intervals as the midpoint of the segments between successive quanta (Lloyd 1982);

  • minimizing a distance between the two distribution functions: the discrete approximation is obtained as the distribution minimizing some statistical distance between the two (continuous and step-wise) cdf. In Kennan (2006), an analytical solution is found for the optimal k-point discrete approximation when employing a particular class of distances.

In this work, we will concentrate on the last class of discrete approximations. More precisely, we propose the use of three distances: the Cramér-von Mises, the Anderson-Darling, and the Cramér distance between the cdf of the assigned continuous rv and the step-wise cdf of its k-point discrete approximation. They can be all regarded as weighted Cramér-von Mises distances with a proper choice of the weighting function. We will derive for these three cases the optimal solution (i.e., the k-point discrete approximation leading to the minimum distance) either analytically or computationally, providing in the latter case some details about the numerical procedure to be implemented. To the best of our knowledge, these distances have not been used regularly in the existing literature for the discrete approximation of a continuous rv. The aim of this paper is to shed some light on their use and explain their pros and cons.

The rest of the paper is structured as follows. In the next section, we will discuss distances between cdf in general and then present and solve the main research question, that is, finding an optimal k-point discrete approximation to a continuous random distribution through minimization of a statistical distance. The three statistical distances mentioned above will be considered and the corresponding solutions to the problem will be described and compared, also practically referring to some known parametric families of continuous distributions. Section 3 illustrates two applications of discretization to the (approximate) determination of the distribution or of some parameter of an assigned function of independent rvs, which is carried out by using each of the discrete approximations previously described and also other existing techniques. A software implementation in the R programming environment is presented in Sect. 4. Some comments and remarks are provided in the last section.

2 Optimal k-point approximation based on minimization of a distance between distribution functions

2.1 Statement of the problem

Let us consider two cdfs F(x) and G(x), and assume that F(x) possesses a density, say f(x), so that we can write \(F(x)=\int _{-\infty }^x f(u)\text {d}u\) for any real x. Several statistical distances between such two cdfs have been proposed. The most popular is the Kolmogorov-Smirnov (KS) distance, defined as \(\sup _{x\in {\mathbb {R}}} |F(x)-G(x)|\), i.e., the maximum (or, better, supremum) absolute difference between the two cdfs. The KS distance is commonly employed when one has to test whether an i.i.d. sample \((x_1,x_2,\dots ,x_n)\) is consistent with some known cdf F; in this case, the distance to be computed is between the assigned F and the empirical cdf \({\tilde{F}}(x)\), defined as \({\tilde{F}}(x)=\frac{1}{n}\sum _{i=1}^n \mathbbm {1}(x_i\le x)\).

Another wide family of distances between cdfs is the following

$$\begin{aligned} d_2(F,G) = \int _{-\infty }^{+\infty } [F(x)-G(x)]^2 w(x) \text {d}x, \end{aligned}$$

where w(x) is some non-negative function on \({\mathbb {R}}\).

If we set \(w(x)\equiv 1\), then we obtain the second-order squared Cramér distance (Cramér 1928), which is closely related to the so-called energy distance (Rizzo and Székely 2016). If we set \(w(x)\equiv f(x)\), then we obtain the so-called Cramér-von Mises distance, which can be also conveniently written as \(\int _0^1 [u-G(F^{-1}(u))]^2\text {d}u\), provided that F(x) is invertible. If we set \(w(x)=f(x)[F(x)(1-F(x))]^{-1}\), then we obtain the Anderson-Darling distance; with respect to Cramér-von Mises, this distance puts more weight in the tails of the distribution, i.e., where F(x) is close to either zero or one.

Hanebeck and Klumpp (2008), also referring to the less explored multivariate case, remark “how statistical distances are used for both analysis and synthesis purposes. Analysis is concerned with assessing whether a given sample stems from a given continuous distribution. Synthesis is concerned with both density estimation, i.e., calculating a suitable continuous approximation of a given sample, and density discretization, i.e., approximation of a given continuous random vector by a discrete one”. In this work, we focus on the latter aspect of research.

If one is interested in building an optimal k-point discrete approximation of a given continuous cdf F, i.e., a discrete probability distribution somehow resembling the original one, then he/she can find it as the discrete distribution minimizing, over all the k-point discrete distributions, the distance (1), computed between the cdf F and the cdf \({\hat{F}}\) of the discrete approximation, for a given choice of the weighting function w(x).

The problem can be more formally stated as follows. Suppose we want to approximate the continuous probability distribution F by a discrete probability distribution \({\hat{F}}\), consisting of \(k>1\) points \(x_1<x_2<\dots<x_{k-1}<x_k\), with probabilities \(p_i\), \(i=1,\dots ,k\) (obviously, \(p_i\ge 0\) for \(i=1,\dots ,k\) and \(\sum _{i=1}^k p_i=1\)). Let us gather the support points and the probabilities of the discrete distribution in a vector \(\pmb {\eta }=(x_1,\dots ,x_k,p_1,\dots ,p_k)\). Then, the optimal discrete distribution can be defined as the one (univocally identified by \(\hat{\pmb {\eta }}\)) minimizing \(d(F(x),{\hat{F}}(x;\pmb {\eta }))\): \(\hat{\pmb {\eta }} = \arg \min d(F,{\hat{F}};\pmb {\eta })\). For the case \(w(x)\equiv f(x)\), corresponding to the Cramér-von Mises distance, an analytical solution is available (Kennan 2006). When w(x) is a piece-wise constant function, the problem was solved numerically by Schremp et al. (2006), through an iterative procedure based on a homotopy continuation approach. For any other possible choice of the weighting function w(x), we can solve the minimization problem (at least) numerically.

We start our analysis from the Cramér-von Mises distance.

2.2 Cramér-von Mises

The Cramér-von Mises distance between continuous cdf is one of the distinguished measures of deviation between distributions (Cramér 1928; von Mises 1931). For a probabilistic interpretation, see Baringhaus and Henze (2017). It is obtained by setting \(w(x)\equiv f(x)\) in (1), so that the distance between F and \({\hat{F}}\) becomes

$$\begin{aligned} d_{CvM}(F,{\hat{F}})&=\int _{{\mathbb {R}}} |F(x)-{\hat{F}}(x)|^2 \text {d}F(x)=\int _0^1|t-{\hat{F}}(F^{-1}(t))|^2 \text {d}t, \end{aligned}$$

where \(F^{-1}\) denotes the mathematical inverse of the cdf F of X, which we assume to be strictly increasing over the support of X; the distance in (2) is a particular case of

$$\begin{aligned} d_r(F,{\hat{F}})&=\int _{{\mathbb {R}}} |F(x)-{\hat{F}}(x)|^r \text {d}F(x)=\int _0^1|t-{\hat{F}}(F^{-1}(t))|^r \text {d}t, r>0. \end{aligned}$$

The best k-point discrete approximation (that is, the one yielding the minimum value of the distance \(d_r\)) has been proved (Kennan 2006) to have, for any \(r>0\), equally-weighted support points \(x_i\) (i.e., with probability 1/k each) such that \(F(x_i)=\frac{2i-1}{2k}\) or, equivalently,

$$\begin{aligned} x_i=F^{-1}\left( \frac{2i-1}{2k}\right) ,\quad i=1,\dots ,k. \end{aligned}$$

Here we report the proof, fixing and adding some points left out in Kennan (2006).

Let us introduce the quantities \(q_i\) and \(Q_i\), defined in the following way: \(q_i=F(x_i)\), \(i=1,\dots ,k\), setting \(q_0=0\) and \(q_{k+1}=1\); \(Q_i={\hat{F}}(x_i)\), \(i=1,\dots ,k\), setting \(Q_0=0\); it follows that \(Q_k=1\). Therefore, the \(q_i\) represent the values of the cdf of the continuous rv at the support points \(x_i\) of its discrete approximation; the \(Q_i\) represent the values of the cdf of the discrete approximating rv at its support points \(x_i=F^{-1}(q_i)\), so that \(p_i=Q_i-Q_{i-1}\), \(i=1,\dots ,k\) (see Fig. 1).

Fig. 1
figure 1

Meaning of \(q_i\) and \(Q_i\) when constructing a k-point discrete approximation based on the minimization of the Cramér-von Mises distance (here \(k=4\)). Since \(q_i=F(x_i)\) and \(Q_i={\hat{F}}(x_i)\), it is possible to alternatively represent any k-point discrete approximation as a set of k points \((q_i,Q_i)\) belonging to the the unit square

The distance (3) can be thus rewritten as

$$\begin{aligned} d_r(F,{\hat{F}}) = \sum _{i=0}^k \int _{q_i}^{q_{i+1}} |t-Q_i|^r\text {d}t , \end{aligned}$$

and the first-order condition on the \(Q_i\) gives

$$\begin{aligned} |q_i-Q_i|^r - |q_{i+1}-Q_i|^r=0, \end{aligned}$$

from which \(Q_i=(q_i+q_{i+1})/2\); the first-order condition on the \(q_i\) provides

$$\begin{aligned} -|q_i-Q_i|^r + |q_i-Q_{i-1}|^r = 0, \end{aligned}$$

from which \(q_i=(Q_{i-1}+Q_i)/2\). Combining these two last results together, we obtain

$$\begin{aligned} Q_i =\frac{1}{2}\left( \frac{Q_{i-1}+Q_i}{2} + \frac{Q_i+Q_{i+1}}{2} \right) \end{aligned}$$

that is, \(Q_i -Q_{i-1}=Q_{i+1}-Q_i\), and being \(Q_0=0\) and \(Q_k=1\), we derive \(Q_i=i/k\) and \(p_i=1/k\), for \(i=1,\dots ,k-1\). By substituting this expression in that for the \(q_i\), we obtain

$$\begin{aligned} q_i=\frac{1}{2}\left( \frac{i-1}{k}+\frac{i}{k}\right) =\frac{2i-1}{2k}, \end{aligned}$$

and then \(x_i=F^{-1}((2i-1)/k)\) for \(i=1,\dots ,k\). Note the particular form of the optimal solution: the specific continuous distribution enters the equation of the support points through the inverse of its cdf F, which is applied to the \(q_i\), which are “distribution-free” quantities; the probabilities \(p_i\) are independent from F as well, being all constant.

Another interesting property of this solution can be pointed out. Letting \({\mathcal {F}}_k\) be the set of discrete distributions with k support points, we have just proved that for each \(r>1\) the k-point discrete approximation \({\hat{F}}\in {\mathcal {F}}_k\) that satisfies

$$\begin{aligned} \inf _{G\in {\mathcal {F}}_k} d_r^{1/r}(F,G)&= \inf _{G\in {\mathcal {F}}_k} \left( \int _{{\mathbb {R}}} |F(x)-G(x)|^r\text {d}F(x)\right) ^{1/r} \\&=\inf _{G\in {\mathcal {F}}_k} \left( \int _0^1 |t-G(F^{-1}(t))|^r\text {d}t\right) ^{1/r}\\&=\left( \int _0^1 |t-{\hat{F}}(F^{-1}(t))|^r\text {d}t\right) ^{1/r} =\left( \int |F(x)-{\hat{F}}(x)|^r\text {d}F(x)\right) ^{1/r}\\&=d_r^{1/r}(F,{\hat{F}}) \end{aligned}$$

is the discrete uniform distribution on the set of point \(x_i=F^{-1}((2i-1)/(2k))\), \(i=1,\dots ,k\). So, being for each \(G\in {\mathcal {F}}_k\)

$$\begin{aligned} \lim _{r\rightarrow \infty } \left( \int _0^1 |t-G(F^{-1}(t))|^r\text {d}t\right) ^{1/r}&= \sup _{0<t<1} |t-G(F^{-1}(t))| = \sup _{x\in {\mathbb {R}}} |F(x)-G(x)|\\&=d_{KS}(F,G), \end{aligned}$$

recalling the relationship between the \(L^r\) norm and the supremum norm (see e.g. Stein and Shakarchi 2011), one obtains

$$\begin{aligned} d_{KS}(F,G) = \lim _{r\rightarrow \infty } d_r^{1/r}(F,G) \ge \lim _{r\rightarrow \infty } d_r^{1/r}(F,{\hat{F}}) = d_{KS}(F,{\hat{F}}) \end{aligned}$$

and deduces from this that

$$\begin{aligned} \inf _{G\in {\mathcal {F}}_k} d_{KS}(F,G) = d_{KS} (F,{\hat{F}}), \end{aligned}$$

that is, for a fixed integer k, the best approximating distribution obtained by minimizing \(d_r\) is the same we would obtain by minimizing \(d_{KS}\).

Another peculiarity of the optimal solution or, better, of the statistical distance used as a criterion for finding an optimal solution, is its all-encompassing applicability, since it does not require the existence and finiteness of any integer moment of the original continuous distribution: the distance (3), since it can be rewritten as an integral over the unit interval of a quantity which is finite, can be always computed and always possesses a global minimum, corresponding to the solution derived above. Therefore, it is possible to derive such a discrete approximation for Student’s t, say, with any value of the degree-of-freedom parameter, and other heavy-tailed distributions, which would not be possible in general if using the moment-matching or quantization techniques (comprising the method by Drezner and Zerom (2016), which can be considered as a compromise between the two techniques): the finiteness of the first \(2k-1\) moments is required by the former, the finiteness of the first two moments by the latter. It can be also shown (Barbiero and Hitaj 2022) that \(\lim _{k\rightarrow \infty } {\hat{F}}_k(x) = F(x)\) \(\forall x\in {\mathbb {R}}\), i.e., the approximating k-point discrete rv converges in distribution to the original rv.

The distance (3) can be further rewritten as the sum of \(k+1\) integrals,

$$\begin{aligned} d_r(F,{\hat{F}})&=\int _0^{q_1} t^r\text {d}t + \sum _{i=1}^{k-1} \int _{q_i}^{q_{i+1}} |t-Q_i|^r\text {d}t + \int _{q_k}^1 |1-t|^r\text {d}t\\&=\frac{q_1^{r+1}}{r+1} + \sum _{i=1}^{k-1}\left( \frac{|q_{i+1}-Q_i|^{r+1}}{r+1}+\frac{|Q_i-q_i|^{r+1}}{r+1}\right) + \frac{|1-q_k|^{r+1}}{r+1}, \end{aligned}$$

and its minimum value is thus equal to

$$\begin{aligned} \min d_{r}&=\left( \frac{1}{2k}\right) ^{r+1}\frac{1}{r+1} + 2(k-1)\left( \frac{1}{2k}\right) ^{r+1}\frac{1}{r+1} + \left( \frac{1}{2k}\right) ^{r+1}\frac{1}{r+1}\nonumber \\&=\frac{(2k)^{-r}}{r+1}, \end{aligned}$$

since for the optimal solution \(q_1=q_{i+1}-Q_i=Q_i-q_i=1-q_k=1/(2k)\), for each \(i=1,\dots ,k-1\). Furthermore, as one can expect, the minimum distance is a decreasing function of k for any fixed \(r>0\): by increasing the number of points, the optimal approximating discrete distribution gets closer (in terms of Cramér-von Mises distance) to the continuous one.

Table 1 displays, just for illustrative purposes, k-point approximations (\(k=5;6;7\)) of a standard normal and an exponential distribution (with unit rate parameter). For each k and for both distributions, values of expectation and variance (simply indicated as \(\mu \) and \(\sigma ^2\)) are reported, in order to be easily compared with the analogous values of the continuous distribution.

Table 1 k-point discrete approximations of two continuous random distributions based on the minimization of the Cramér-von Mises distance; \(\mu \) and \(\sigma ^2\) indicate expectation and variance of the approximation

For the standard normal distribution, variance and kurtosis of the k-point discrete approximation monotonically converge to the corresponding value of the original continuous distribution rather slowly: kurtosis, in particular, when \(k=100\), is 2.834. This result could have been expected: although the cdf of this discrete approximation converges to the cdf of the original continuous rv, this does not mean that for a finite k the two functions are so similar: for the normal case, recall that the pdf of the continuous rv is displayed through the classical bell-shaped curve, whereas the pmf of the discrete approximation has uniform probabilities, and this overall translates into a mismatch of (even) moments. Recall that moment equalization would produce discrete distributions with a very large range and very small probabilities assigned to the extreme support points (Barbiero and Hitaj 2022).

For the discrete approximation of the exponential distribution, expected value, variance, skewness, and kurtosis tend monotonically and asymptotically to the value of the parent distribution (1, 1, 2, and 9, respectively), but for finite k, it underestimates kurtosis to a large extent; when \(k=7\), the value of kurtosis for the discretized exponential distribution is 2.779 (its expected value is 0.951, its variance 0.686 and its skewness 0.963); when \(k=100\), it is 6.662 (its expected value is 0.997, its variance 0.960 and its skewness 1.759).

2.3 Anderson-Darling

What happens if we consider a weighted Cramér-von Mises distance, for example the Anderson-Darling distance, characterized by the weighting function \(w(x)\equiv f(x)/[F(x)(1-F(x))]\)? We may expect that the best approximating distribution has now unequal probabilities or that the support points are no longer quantiles of equally-spaced orders of the original distribution, as it occurs with Cramér-von Mises distance.

Minimizing the Anderson-Darling distance is equivalent to minimizing the following quantity, which consists of the sum of \(k+1\) contributions:

$$\begin{aligned} d_{AD}(F,{\hat{F}})&=\int _{{\mathbb {R}}} |F(x)-{\hat{F}}(x)|^2 [F(x)(1-F(x))]^{-1}\text {d}F(x)\\&=\int _0^1\frac{|t-{\hat{F}}(F^{-1}(t))|^2}{t(1-t)} \text {d}t=\sum _ {i=0}^k \int _{q_{i}}^{q_{i+1}} \frac{(t-Q_i)^2}{{t(1-t)}}\text {d}t, \end{aligned}$$

with respect to the \(q_i\) and \(Q_i\). The first-order condition for \(Q_i\) is

$$\begin{aligned} \int _{q_{i}}^{q_{i+1}} \frac{d}{dQ_i} \frac{(t-Q_i)^2}{{t(1-t)}}\text {d}t&= \int _{q_{i}}^{q_{i+1}} -\frac{2(t-Q_i)}{t(1-t)} \text {d}t\\&=\int _{q_{i}}^{q_{i+1}} -\frac{2}{1-t}+\frac{2Q_i}{t(1-t)}\text {d}t \\&= \left. 2\log (1-t) + 2Q_i\log \left( \frac{t}{1-t} \right) \right| _{q_i}^{q_{i+1}}\\&= 2\log \left( \frac{1-q_{i+1}}{1-q_i}\right) \\&\quad + 2Q_i\log \left( \frac{q_{i+1}(1-q_i)}{q_i(1-q_{i+1})} \right) =0, \end{aligned}$$

for \(i=1,\dots ,k\), from which

$$\begin{aligned} Q_i = \log \left( \frac{1-q_{i}}{1-q_{i+1}}\right) / \log \left( \frac{q_{i+1}(1-q_i)}{q_i(1-q_{i+1})} \right) . \end{aligned}$$

The first-order condition for \(q_i\) implies

$$\begin{aligned} \frac{|q_i-Q_{i-1}|^2}{{q_i(1-q_i)}} - \frac{|q_{i}-Q_{i}|^2}{{q_{i}(1-q_{i})}} = 0,\quad i=1,\dots ,k, \end{aligned}$$

from which we obtain again

$$\begin{aligned} q_i= (Q_{i-1} + Q_i)/2, i=1,\dots ,k. \end{aligned}$$

The solution derived by Eqs.(6) and (7) cannot be expressed in an analytic closed form for each \(q_i\) and \(Q_i\). However, a simple iterative algorithm can be implemented with the aim of recovering their values numerically; as initial guess values for \(q_i\) and \(Q_i\), we adopt their optimal values found by minimizing the Cramér-von Mises distance. In a similar fashion as done in Pavlikov and Uryasev (2018), one can think of alternatively updating the values of the probabilities \(p_i\) and the values of the discrete points \(x_i\) (or, better, their probability transforms \(q_i\)) till convergence. Note that in Pavlikov and Uryasev (2018) all the \(p_i\) are initially set equal to 1/k, as for the analytical solution based on the Cramér-von Mises distance. The algorithm works as follows:

  1. 1.

    Set \(t=0\); for \(i=1,\dots ,k\), set \(p_i^{(0)}=1/k\), \(Q_i^{(0)}=i/k\)

  2. 2.

    Set \(\epsilon ^{(0)}=1\) (or any large positive value) and \(\epsilon _{\max }=10^{-6}\) (or any arbitrarily small positive value, to be used for checking convergence of the solution)

  3. 3.

    While \(\epsilon ^{(t)}>\epsilon _{\max }\):

    1. (a)

      Update the iteration index \(t\leftarrow t+1\)

    2. (b)

      Update the \(q_i\) according to (7): \(q_i^{(t)}=\frac{Q_{i-1}^{(t-1)} + Q_i^{(t-1)}}{2}\)

    3. (c)

      Update the \(Q_i\) according to (6): \(Q_i^{(t)} = \log \left( \frac{1-q_{i}^{(t)}}{1-q_{i+1}^{(t)}}\right) / \log \left( \frac{q_{i+1}^{(t)}(1-q_i^{(t)})}{q_i^{(t)}(1-q_{i+1}^{(t)})} \right) \)

    4. (d)

      Derive the updated probabilities for the discrete rv: \(p_i^{(t)}=Q_i^{(t)}-Q_{i-1}^{(t)}\)

    5. (e)

      Calculate the maximum absolute deviation between two consecutive iterations in terms of \(p_i\) : \(\epsilon ^{(t)}=\max _{i=1}^k |p_i^{(t)}-p^{(t-1)}_i|\). Alternatively, other distances can be used to compare the probability vectors obtained in two consecutive iterations, as the Euclidean distance \(\sqrt{\sum _{i=1}^k(p_i^{(t)}-p^{(t-1)}_i)^2}\)

  4. 4.

    Return \(q_i^{(t)}\), \(Q_i^{(t)}\) and \(p_i^{(t)}\)

The number of iterations required clearly depends on the threshold \(\epsilon _{\max }\) and on the number of points k: diminishing \(\epsilon _{\max }\) or increasing k leads to a larger number of iterations; for plausible values of \(\epsilon \) (say \(10^{-8}\)) and k (say smaller than one hundred), the computation times are in the order of fractions of a second.

By numerical inspection, the Anderson-Darling weighting function leads to an inverted U-shaped trend for the \(p_i\), which turn out to be symmetrical around the central value or values, according to whether k is odd or even, i.e., \(p_j=p_{k-j+1}\), \(j=1,\dots ,k\); see Table 2. As for the optimal \(q_i\) values, they present a form of symmetry around the central value(s), such that a continuous symmetrical distribution remains symmetrical after discretization (see Tables 2 and 3a, referring to the discretization of a standard normal rv).

If we consider the exponential distribution with unit rate parameter and we derive the k-point discrete approximation by minimizing the Anderson-Darling distance with \(k=7\) (see Table 3b), its expected value turns to be equal to 0.961, its variance 0.743, its skewness 1.147, and its kurtosis 3.424. Although not so close to the values of the underlying continuous distribution, the discrepancies are smaller if compared to those resulting from Cramér-von Mises approximation (see the previous subsection).

Table 2 5-, 6- and 7-point discrete approximations based on the minimization of the Anderson-Darling distance
Table 3 k-point discrete approximation of two continuous random distributions based on the minimization of the Anderson-Darling distance

We note that using a distance \(d(F,{\hat{F}})=\int _{{\mathbb {R}}}(F(x)-{\hat{F}}(x))^2w(F(x))\text {d}F(x)\), with any other possible positive weighting function w(x), even asymmetrical, supplies (although in general only numerically) values of \(q_i\) and \(Q_i\) that do not depend on the specific cdf F(x); the original distance can in fact be rewritten as \(\sum _{i=0}^k \int _{q_i}^{q_{i+1}}(t-Q_i)^2w(t)\text {d}t\) and the first-order condition on the \(Q_i\) becomes \(\int _{q_i}^{q_{i+1}}(t-Q_i)w(t)\text {d}t=0\) and thus the \(Q_i\) that solve this condition can be expressed as a function, dependent on w(x), of the \(q_i\) and \(q_{i+1}\) only. The simplest and most apparent case is provided by the “unweighted” Cramér-von Mises distance discussed in Sect. 2.2, where such values are available through simple analytical formulas involving just the index i and the number of support points k.

2.4 Cramér distance

We now consider the Cramér distance between the cdf of a continuous rv and that of its discrete approximation:

$$\begin{aligned} d_C(F,{\hat{F}})&=\int _{-\infty }^{+\infty } (F(x)-{\hat{F}}(x))^2\text {d}x = \sum _{i=0}^k \int _{F^{-1}(q_i)}^{F^{-1}(q_{i+1})} (F(x)-Q_i)^2\text {d}x, \end{aligned}$$

which can be also rewritten as

$$\begin{aligned} d_C(F,{\hat{F}})&=\int _0^1 \frac{(t-{\hat{F}}(F^{-1}(t)))^2}{F'(F^{-1}(t))}\text {d}t =\sum _{i=0}^k \int _{q_i}^{q_{i+1}} \frac{(t-Q_i)^2}{F'(F^{-1}(t))}\text {d}t. \end{aligned}$$

This last expression highlights how, differently from the two distances previously examined, the Cramér distance may not be always computed: for some cdf F, the first and/or the last term of the sum, in fact, may be not finite: although the two corresponding integrals are computed over a limited interval, the integrand function may be not finite (when t tends to 0 or t tends to 1, the denominator \(F'(F^{-1}(t))\) may tend to zero; and in the former case, for example, it may tend to zero as fast as \(t^3\) or faster and then the improper integral would not converge). We will see an example later.

Assuming that the Cramér distance can be computed, the first-order condition on the \(q_i\) leads again to

$$\begin{aligned} q_i=\frac{Q_i+Q_{i-1}}{2}, i=1,\dots ,k; \end{aligned}$$

while the first-order condition on the \(Q_i\) leads to

$$\begin{aligned} \int _{q_i}^{q_{i+1}}\frac{\text {d}}{\text {d}Q_i}\frac{(t-Q_i)^2}{f(F^{-1}(t))}\text {d}t&=\int _{q_i} ^{q_{i+1}}\frac{-2(t-Q_i)}{f(F^{-1}(t))}\text {d}t\\&=-2\int _{F^{-1}(q_i)}^{F^{-1}(q_{i+1})}(F(x)-Q_i)\text {d}x=0 \end{aligned}$$

which becomes

$$\begin{aligned} \int _{F^{-1}(q_i)}^{F^{-1}(q_{i+1})}F(x)\text {d}x = Q_i[F^{-1}(q_{i+1}) - F^{-1}(q_i)], \end{aligned}$$

for \(i=1,\dots ,k\). This condition therefore depends on the specific distribution F of the continuous rv X.

We will now analytically develop condition (9) for several choices of F, by examining some well known parametric families. We will see that in general Eq. (9), combined with the condition on the \(q_i\), does not lead to a closed-form solution of the discrete approximation, which must be solved numerically, for instance resorting to the iterative procedure sketched out in Sect. 2.3.

2.4.1 Normal

In case of a standard normal rv, with cdf \(F(x)=\varPhi (x)=\int _{-\infty }^x \frac{1}{\sqrt{2\pi }}e^{-t^2/2}\text {d}t\), for which the following equality holds (see, for example, Owen (1980), formula 1000):

$$\begin{aligned} \int \varPhi (x)\text {d}x=x\varPhi (x)+\phi (x)+\text {constant}, \end{aligned}$$

where \(\phi (x)=\varPhi '(x)\), the first-order condition on the \(Q_i\) becomes

$$\begin{aligned} Q_i&= \frac{\varPhi ^{-1}(q_{i+1})\cdot q_{i+1} + \phi (\varPhi ^{-1}(q_{i+1})) -\varPhi ^{-1}(q_i)\cdot q_i -\phi (\varPhi ^{-1}(q_i))}{\varPhi ^{-1}(q_{i+1}) - \varPhi ^{-1}(q_i)},\quad i\nonumber \\&=1,\dots ,k, \end{aligned}$$

which, along with the first-order condition on the \(q_i\), allows us to determine the optimal solution numerically, following analogous steps to those sketched out in Sect. 2.3.

If instead of considering a standard normal rv, we focus on a generic normal rv with parameters \(\mu \) and \(\sigma ^2\), with cdf \(F(x)=\varPhi (\frac{x-\mu }{\sigma })\), the first-order condition on the \(Q_i\) would not change; both the left and right members of (9), in fact, are simply multiplied by \(\sigma \).

Table 4 displays the values \((x_i,p_i)\) of the best k-point approximation of a standard normal rv, for \(k=5;6;7\). We note that, as for the discrete approximation based on the minimization of the Anderson-Darling distance, the support points are symmetrical around zero and the probabilities satisfy \(p_j=p_{k-j+1}\), \(j=1,\dots ,k\).

Table 4 5-, 6- and 7-point discrete approximation of a standard normal based on the minimization of the Cramér distance

2.4.2 Exponential

In case of an exponential rv, with cdf \(F(x)=1-e^{-\lambda x}\), \(x>0\), and quantile function \(F^{-1}(u)=-\frac{\log (1-u)}{\lambda }\), \(0<u<1\), we have that the first order condition on the \(Q_i\) can be written as

$$\begin{aligned}&\int _{F^{-1}(q_i)}^{F^{-1}(q_{i+1})}(1-e^{-\lambda x})\text {d}x=Q_i\left[ -\frac{\log (1-q_{i+1})}{\lambda }+\frac{\log (1-q_i)}{\lambda }\right] \Longrightarrow \\&\left[ x+\frac{e^{-\lambda x}}{\lambda }\right] _{-\frac{\log (1-q_i)}{\lambda }}^{-\frac{\log (1-q_{i+1})}{\lambda }}=\frac{Q_i}{\lambda }\log \frac{1-q_{i}}{1-q_{i+1}}\Longrightarrow \\&\frac{1}{\lambda }[\log (1-q_i)-\log (1-q_{i+1})+1-q_{i+1}-(1-q_i)]=\frac{Q_i}{\lambda }\log \frac{1-q_{i}}{1-q_{i+1}}, \end{aligned}$$

from which

$$\begin{aligned} Q_i = 1-\frac{q_{i+1}-q_i}{\log \frac{1-q_i}{1-q_{i+1}}}. \end{aligned}$$

Table 5 displays the values \((x_i,p_i)\) of the best k-point approximation of an exponential rv with unit parameter for some values of k. It is very important to notice that empirical inspection shows that this type of approximation is characterized by a decreasing pmf, which thus resembles the decreasing trend of the pdf; on the contrary, the discrete approximation derived from the minimization of the Anderson-Darling distance possesses an increasing-decreasing pmf; the discrete approximation based on the minimization of Cramér-von Mises distance possesses a constant pmf. If we consider the exponential distribution with unit rate parameter and its optimal 7-point discrete approximation, the latter has an expected value equal to 0.972, a variance equal to 0.804, a skewness 1.313, and a kurtosis 4.089. Such values are closer to the analogous ones for the exponential distribution if compared to the discrete approximations obtained from the minimization of Cramér-von Mises and Anderson-Darling distances.

Table 5 5-, 6- and 7-point optimal discrete approximations of an exponential with unit rate parameter based on the minimization of Cramér distance

2.4.3 Cauchy

In case of a standard Cauchy rv, with cdf \(F(x)=\frac{1}{\pi }\arctan x+\frac{1}{2}\), \(x\in {\mathbb {R}}\), we have that the first-order condition on the \(Q_i\) provides

$$\begin{aligned}&\int _{\tan [\pi (q_i-1/2)]}^{\tan [\pi (q_{i+1}-1/2)]}\left( \frac{1}{\pi }\arctan x+\frac{1}{2}\right) \text {d}x\\&= Q_i\left\{ \tan [\pi (q_{i+1}-1/2)] - \tan [\pi (q_i-1/2)]\right\} \end{aligned}$$

and then

$$\begin{aligned}&\left[ \frac{1}{\pi }\left( x\arctan x-\frac{1}{2}\log (1+x^2)\right) +\frac{x}{2}\right] _{\tan [\pi (q_i-1/2)]}^{\tan [\pi (q_{i+1}-1/2)]}\\&\quad =Q_i\left\{ \tan [\pi (q_{i+1}-1/2)] - \tan [\pi (q_i-1/2)]\right\} , \end{aligned}$$

from which

$$\begin{aligned}&Q_i = \frac{q_{i+1}\tan [\pi (q_{i+1}-1/2)] - q_{i}\tan [\pi (q_{i}-1/2)] -\frac{1}{2\pi }\log \frac{1+\tan ^2 [\pi (q_{i+1}-1/2)]}{1+\tan ^2 [\pi (q_i-1/2)]}}{\tan [\pi (q_{i+1}-1/2)] - \tan [\pi (q_i-1/2)]}. \end{aligned}$$

We highlight that it is possible to find a k-point discrete approximation for the Cauchy distribution by minimizing the Cramér distance (as well as Cramér-von Mises or Anderson-Darling), although such a distribution does not possess any positive integer moment and then any discretization technique based on some form of moment-equalization – among others, moment matching, which requires equalization of the first \(2k-1\) moments, but even the technique presented by Drezner and Zerom (2016) – would be not applicable at all.

2.4.4 Logistic

If we consider the standard logistic distribution with cdf \(F(x)=(1+e^{-x})^{-1}\), \(x\in {\mathbb {R}}\), then its inverse cdf is \(F^{-1}(u)=\log \frac{u}{1-u}\), \(0<u<1\), and the first-order condition on the \(Q_i\) can be written as:

$$\begin{aligned}&\int _{\log \frac{q_i}{1-q_i}}^{\log \frac{q_{i+1}}{1-q_{i+1}}} \frac{1}{1+e^{-x}}\text {d}x = Q_i\left[ \log \frac{q_{i+1}}{1-q_{i+1}} - \log \frac{q_i}{1-q_i} \right] \end{aligned}$$

from which

$$\begin{aligned} \Big [ \log (1+e^x)\Big ]_{\log \frac{q_i}{1-q_i}}^{\log \frac{q_{i+1}}{1-q_{i+1}}}&= \log \frac{1}{1-q_{i+1}} - \log \frac{1}{1-q_i} \\&= Q_i\left[ \log \frac{q_{i+1}}{1-q_{i+1}} - \log \frac{q_i}{1-q_i} \right] \end{aligned}$$

and then

$$\begin{aligned} Q_i&=\frac{\log \frac{1-q_i}{1-q_{i+1}}}{\log \frac{q_{i+1}(1-q_i)}{(1-q_{i+1})q_i}}=\frac{\log \frac{1-q_i}{1-q_{i+1}}}{\log \frac{q_{i+1}}{q_i}+\log \frac{1-q_i}{1-q_{i+1}}}. \end{aligned}$$

We note that the latter condition is the same as condition (6), obtained for the optimal k-point approximation based on the minimization of Anderson-Darling distance. This occurs since for the logistic distribution the equality \(f(x)[F(x)(1-F(x))]^{-1}=1\) holds for any \(x\in {\mathbb {R}}\), and hence the Cramér distance coincides with the Anderson-Darling distance.

Similarly to what happens with the normal distribution, if we consider a non-standard logistic distribution with location parameter \(\mu \) and scale parameter \(\sigma \), with cdf \(F(x;\mu ,\sigma )=\frac{1}{1+e^{-\left( \frac{x-\mu }{\sigma }\right) }}\) and quantile function \(F^{-1}(u;\mu ,\sigma )=\mu +\sigma \log \frac{u}{1-u}\), it is straightforward to see that the first-order condition on \(Q_i\) would remain the same.

2.4.5 Lomax or Pareto

For the Lomax distribution with scale parameter \(\lambda >0\) and shape parameter \(\alpha >0\), the expression of the cdf is \(F(x) = 1- \left( 1+\frac{x}{\lambda } \right) ^{-\alpha }\), \(x>0\); the quantile function is \(F^{-1}(u)=\lambda [(1-u)^{-1/\alpha }-1]\). The first-order condition on \(Q_i\), if \(\alpha \ne 1\), is

$$\begin{aligned}&\left[ x-\frac{\lambda }{1-\alpha } \left( 1+\frac{x}{\lambda }\right) ^{1-\alpha } \right] _{\lambda [(1-q_i)^{-1/\alpha }-1]}^{\lambda [(1-q_{i+1})^{-1/\alpha }-1]}\\&\quad =Q_i\left\{ \lambda [(1-q_{i+1})^{-1/\alpha }-1]-\lambda [(1-q_i)^{-1/\alpha }-1]\right\} , \end{aligned}$$

from which

$$\begin{aligned} Q_i = 1 - \frac{1}{1-\alpha }\frac{(1-q_{i+1})^{(\alpha -1)/\alpha } - (1-q_{i})^{(\alpha -1)/\alpha }}{(1-q_{i+1})^{-1/\alpha } - (1-q_{i})^{-1/\alpha }}. \end{aligned}$$

In particular, if \(\alpha =2\), formula (12) reduces to the following expression

$$\begin{aligned} Q_i = 1 - \sqrt{(1-q_i)(1-q_{i+1})}. \end{aligned}$$

If \(\alpha =1\), the first-order condition on the \(Q_i\) can be written as

$$\begin{aligned}&\lambda Q_i[(1-q_{i+1})^{-1} - (1-q_i)^{-1}] =\left[ x-\lambda \log \left( 1+\frac{x}{\lambda } \right) \right] _{\lambda [(1-q_i)^{-1}-1]}^{\lambda [(1-q_{i+1})^{-1}-1]}, \end{aligned}$$

from which

$$\begin{aligned} Q_i = 1 - \frac{\log \frac{1-q_i}{1-q_{i+1}}}{(1-q_{i+1})^{-1}-(1-q_i)^{-1}} \end{aligned}$$

If \(\alpha \le 1/2\), it can be proved that the Cramér distance cannot be computed; in fact, for any feasible value of \(\lambda \) and k, since the integrand function of the last integral in (8), \(\int _{q_k}^1 (1-t)^2/f(F^{-1}(t))\text {d}t\), turns out to be proportional to \((1-t)^{1-1/\alpha }\), then, being \(\alpha \le 1/2\), the (improper) integral is not finite for any feasible value of \(q_k\). We highlight that the Lomax distribution does not possess any integer moment for \(\alpha \le 1\), so the proposed procedure is still able to produce a k-point discrete approximation even if the distribution does not possess a finite expectation, for \(1/2<\alpha \le 1\).

We would obtain the same results if we considered the Pareto rv with cdf \(F(x)=1-\left( \frac{\lambda }{x}\right) ^\alpha \), \(x>\lambda \), by simply substituting x with \(\lambda +x\).

In general, it is not possible to obtain a closed-form optimal solution, unless for small values of k. For example, if \(\lambda =1\) and \(\alpha =2\), for \(k=2\), the conditions on the \(q_i\) are \(q_1=Q_1/2\) and \(q_2=(1+Q_1)/2\); the condition on \(Q_1\) is \(Q_1=1-\sqrt{(1-q_1)(1-q_2)}\). Substituting the first two expressions for \(q_1\) and \(q_2\) into the last condition, we obtain, after a few simple passages, that \(1-Q_1=\sqrt{\frac{1}{2}+\frac{Q_1^2}{4}-\frac{3Q_1}{4}}\) and then a second-order equation in \(Q_1\) whose unique feasible solution is \(Q_1=\frac{2}{3}\). Consequently, \(q_1=\frac{1}{3}\) and \(q_2=\frac{5}{6}\); the best discrete approximation consists of the points \(x_1=\sqrt{\frac{3}{2}}-1\) and \(x_2=\sqrt{6}-1\) with probabilities \(p_1=2/3\) and \(p_2=1/3\).

2.4.6 Power function

The cdf of a power function rv is \(F(x)=(x/b)^c\), \(0<x<b\), \(b>0\), \(c>0\). The corresponding quantile function is \(F^{-1}(u)=bu^{1/c}\), \(0<u<1\).

The first-order condition on \(Q_i\) turns into

$$\begin{aligned} \left[ \frac{x^{c+1}}{b^c(c+1)}\right] _{bq_i^{1/c}}^{bq_{i+1}^{1/c}}=Q_i\left[ bq_{i+1}^{1/c} - bq_i^{1/c}\right] \end{aligned}$$

from which

$$\begin{aligned} Q_i = \frac{1}{c+1}\cdot \frac{q_{i+1}^{(c+1)/c} - q_i^{(c+1)/c}}{q_{i+1}^{1/c}-q_i^{1/c}}. \end{aligned}$$

If \(c=1\) (corresponding to the case of a uniform distribution) then \(Q_i=(q_i+q_ {i+1})/2\) and we obviously obtain the same solution as in Sect. 2.2. If \(c=2\), then we have

$$\begin{aligned} Q_i&= \frac{1}{3}\frac{q_{i+1}^{3/2}-q_i^{3/2}}{q_{i+1}^{1/2}-q_i^{1/2}}=\frac{1}{3}\frac{(q_{i+1}^{3/2} -q_i^{3/2})(q_{i+1}^{1/2}+q_i^{1/2})}{q_{i+1}-q_i}\\&=\frac{1}{3}\frac{(q_{i+1}-q_i)(q_i+q_{i+1}+\sqrt{q_iq_{i+1}})}{q_{i+1}-q_i}=\frac{1}{3}(q_i+q_{i+1}+\sqrt{q_iq_{i+1}}); \end{aligned}$$

thus the cumulative probability \(Q_i\) of the optimal solution corresponds to the arithmetic mean of the two consecutive probabilities \(q_i\) and \(q_{i+1}\) and their geometric mean.

It can be numerically shown that for values of c larger than 1 (when the pdf is increasing), the k-point discrete approximation has an increasing pmf; on the contrary, for \(0<c<1\) (when the pdf is decreasing), the k-point discrete approximation has a decreasing pmf.

In general, it is not possible to derive the optimal solution in a closed-form, unless for small values of k. For example, if \(b=1\) and \(c=2\), for \(k=2\), the conditions on the \(q_i\) are \(q_1=Q_1/2\) and \(q_2=(1+Q_1)/2\); the condition on \(Q_1\) is \(Q_1=\frac{1}{3}(q_1+q_2+\sqrt{q_1q_2})\). Substituting the first two expressions for \(q_1\) and \(q_2\) into the last equation, we obtain, after a few simple passages, that \(4Q_1-1=\sqrt{Q_1(1+Q_1)}\) and then obtain a second-order equation in \(Q_1\) whose unique feasible solution is \(Q_1=\frac{9+\sqrt{21}}{30}\). Consequently, \(q_1=\frac{9+\sqrt{21}}{60}\) and \(q_2=\frac{39+\sqrt{21}}{60}\); the best discrete approximation consists of the points \(x_1=\sqrt{\frac{9+\sqrt{21}}{60}}\) and \(x_2=\sqrt{\frac{39+\sqrt{21}}{60}}\) with probabilities \(p_1=\frac{9+\sqrt{21}}{30}\) and \(p_2=\frac{21-\sqrt{21}}{30}\).

2.5 Remarks

2.5.1 The case \(k=1\)

We have always implicitly assumed so far that the number of points k by which one approximates the assigned continuous random distribution is greater than 1. It is immediate to realize that if one wants to approximate the distribution of the rv X through one value only, then, except for the cases where the selected statistical distance cannot be computed (see Sect. 2.4.5) the “optimal” approximating value \(x_1\) is the median of F, \(F^{-1}(0.5)\), since the only effective condition to be satisfied (for all the three distances examined) would be \(q_1=(Q_0+Q_1)/2=1/2\), being \(Q_0=0\) and \(Q_1=1\). We note that optimal quantization (Lloyd 1982) would return the expected value \({\mathbb {E}}_F(X)\), as far as it exists and it is finite, as the optimal approximating value of F.

2.5.2 Location-scale transformation

Let us consider a continuous rv \(X\sim F_X\) and its optimal k-point approximation derived from the minimization of the Cramér-von Mises or Anderson-Darling distance, \(x_i=F_X^{-1}(q_i^*)\), \(p_i^*=Q_i^*-Q_{i-1}^*\). We know that the optimal \(q_i^*\) and \(Q_i^*\) are only functions of k and of the selected distance, but do not depend on the specific \(F_X\). Then consider the location-scale transformation \(Y=a+bX\), \(a\in {\mathbb {R}}\), \(b>0\). Since for such a transformation we have \(F_Y(y)=F_X(\frac{y-a}{b})\) for any real y and \(F^{-1}_Y(u)=a+bF^{-1}(u)\), \(0<u<1\), the optimal k-point approximation of Y (derived by using the same distance as for X) is represented by \(y_i=F_Y^{-1}(q_i^*)=a+bF_X^{-1}(q_i^*)=a+bx_i\) and the same \(p_i^*\) as before, which means that the location-scale transformation applies also to the discretized rv.

This property does not hold in general for the discrete approximation based on the minimization of the Cramér distance: the optimal values \(q_i^*\) and \(Q_i^*\) now depend on the specific form of the cdf, and \(F_Y\) may not belong to the same family as \(F_X\), unless the cdf \(F_X\) is a location-scale family of distributions itself: in this case, the property still holds, as the condition (9) does not change if we consider a location-scale transformation of the cdf (recall the examples with the normal and logistic distributions in Sect. 2.4).

3 Example of application

Often a researcher in the statistical field is required to determine the distribution (or parameters) of some complex function of several (independent and continuous) rvs \(T=t(X_1,X_2,\dots ,X_d)\). Multidimensional integration techniques should be employed, but often, due to the complexity of t and to the high dimensionality d, they are either cumbersome to apply or not applicable at all. One can then resort to Monte Carlo simulation, i.e., simulating (independently) a huge number N of pseudo-random values from \(X_1,X_2,\dots ,X_d\) and then calculating the corresponding N values of the transformation t, which can be regarded as a random sample drawn from T. An alternative is to find an approximate solution via approximation-by-discretization and enumeration. This consists of substituting each \(X_i\) with a discrete approximation \({\hat{X}}_i\) and then determine the corresponding pmf of \({\hat{T}}=t({\hat{X}}_1,{\hat{X}}_2,\dots ,{\hat{X}}_d)\) by “enumeration”, based on the joint pmf of \(({\hat{X}}_1,{\hat{X}}_2,\dots ,{\hat{X}}_d)\) over the Cartesian product of the single supports: \({\mathcal {S}}({\hat{X}}_1)\times \dots \times {\mathcal {S}}({\hat{X}}_d)\).

In order to illustrate and compare the discretization techniques proposed in Sect. 2, in the following subsection we will consider a case where it is possible to derive the exact distribution of T analytically, which allows us to evaluate the statistical performance (in terms of degree of approximation, to be measured through some index) of each technique; in Sect. 3.2, we will analyzize a more complicated case where the exact distribution of T is not analytically computable, but a parameter of interest can be recovered numerically.

3.1 Sum of exponential random variables

Let \(X_1\sim \text {Gamma}(\alpha _1,\lambda )\), \(X_2\sim \text {Gamma}(\alpha _2,\lambda )\), ..., \(X_d\sim \text {Gamma}(\alpha _d,\lambda )\), independent to each other; it is well-known that the sum \(S=\sum _{i=1}^d X_i\) is \(\text {Gamma}(\sum _{i=1}^d \alpha _i,\lambda )\). However, let us approximate each \(X_i\) through one of the discrete approximations we illustrated, and then reconstruct the approximated distribution of the sum, \({\hat{F}}_S\). We can then evaluate the degree of approximation by using a measure of discrepancy between \(F_S\) and \({\hat{F}}_S\); we can use the KS distance, rather than one of the statistical distances employed for the univariate approximation. Here we consider the sum of \(d=3\) exponential rvs; \(\lambda _i=1/2\), \(\alpha _i=1\), \(i=1,2,3\), for which we know that \(S\sim \text {Gamma}(3,1/2)\). Table (6), for different values of k (from 5 to 11), reports the KS distance between the exact and the approximated cdf of S, obtained according to the three univariate discrete approximations for the \(X_i\). It is easy to see how the KS distance decreases with k for all the methods: by increasing the number of approximating points for the \(X_i\), it is legitimate to expect that the discrepancy between the true and the approximate cdf decreases, independently from the measure used. Moreover, we notice that the discrete approximation based on Cramér distance overperforms the other two approximations, and its relative level of accuracy quickly improves with k: for \(k=11\), the KS distance provided by the approximation based on the minimization of Cramér distance is much less than one half of the KS distance provided by the approximation based on the minimization of the Cramér-von Mises distance. As we already remarked, the discrete approximation of the exponential distribution based on the minimization of the Cramér distance seems to resemble the shape of the continuous pdf better than the discrete approximations based on the Cramér-von Mises and Anderson-Darling distances, and this also positively reflects when calculating the approximate distribution of the sum of three i.i.d. exponential rvs.

The graph in Fig. 2 displays the curve of the true cdf of the sum, and the three step-wise cdf of the three discrete approximations. It is quite apparent that the latter struggle in resembling the continuous curve in its right tail: it is visible to an unaided eye, in fact, for relatively high values of x (say, between 10 and 15) the three approximations show their maximum absolute error (which corresponds to the value of KS distance).

Table 6 KS distance between the true and the approximated distribution function of the sum of three independent exponential rvs
Fig. 2
figure 2

Comparison between the true distribution function of the sum of exponential rvs and its three approximations (based on 7-point univariate discrete approximations); in the right panel, we zoom in on the right tail of the distribution

3.2 Reliability parameter for a hollow rectangular tube

In Sect. 1, we mentioned the problem of recovering the so-called reliability parameter \(R=P(X>Y)\) for a stress-strength model, where X and Y are the strength and stress rvs, respectively, possibly depending on several stochastic sub-factors. Let us consider this example. The functional form of shear stress of a hollow rectangular tube is \(Y=\frac{M}{2t(W-t)(H-t)}\), where M is the applied torque, t is the wall thickess, W is the width, and H the height of the tube. Let X, M, t, W, and H be mutually independent rvs. Consider the following parametric set-up: \(M\sim {\mathcal {N}}(\mu _M=1500,\sigma _M=150)\), \(t\sim {\mathcal {N}}(\mu _t=0.2,\sigma _t=0.005)\), \(W\sim {\mathcal {N}}(\mu _W=2,\sigma _W=0.02)\), \(H\sim {\mathcal {N}}(\mu _H=3,\sigma _H=0.03)\) (with \({\mathcal {N}}\) denoting the normal distribution) which accord with that in Roy and Dasgupta (2001). Let the standard deviation of the normal strength X be 60, and assume that the mean of X can vary from 520 to 970 units by steps of 10, thus generating an array of \(n=46\) scenarios.

Although the exact evaluation of R is unfeasible here, as suggested by a referee, its value can be obtained numerically as follows. Denoting by \(\phi _{\mu ,\sigma }\) and \(\varPhi _{\mu ,\sigma }\) the pdf and cdf, respectively, of a normal rv with mean \(\mu \) and standard deviation \(\sigma \), putting

$$\begin{aligned} f(t,m,w,h)&= \left[ 1-\varPhi _{\mu _X,\sigma _X}\left( \frac{m}{2t(w-t)(h-t)}\right) \right] \\&\cdot \phi _{\mu _t,\sigma _t}(t)\phi _{\mu _M,\sigma _M}(m)\phi _{\mu _W,\sigma _W}(w)\phi _{\mu _H,\sigma _H}(h) \end{aligned}$$

for \((t,m,w,h)\in {\mathbb {R}}^4\), the reliability parameter R can be expressed as

$$\begin{aligned} R=\int \int \int \int _{{\mathbb {R}}^4} f(t,m,w,h)\text {d}t\text {d}m\text {d}w\text {d}h, \end{aligned}$$

and can be approximated by

$$\begin{aligned} R_Q=\int \int \int \int _{Q} f(t,m,w,h)\text {d}t\text {d}m\text {d}w\text {d}h, \end{aligned}$$

where Q is the hyper-rectangle defined as \(Q=I_t\times I_M\times I_W\times I_H\), with \(I_t=[\mu _t-\gamma _t\sigma _t,\mu _t+\gamma _t\sigma _t]\), \(I_M=[\mu _M-\gamma _M\sigma _M,\mu _M+\gamma _M\sigma _M]\), \(I_W=[\mu _W-\gamma _W\sigma _W,\mu _W+\gamma _W\sigma _W]\), \(I_H=[\mu _H-\gamma _H\sigma _H,\mu _H+\gamma _H\sigma _H]\), and \(\gamma _t\), \(\gamma _M\), \(\gamma _W\), \(\gamma _H\) are positive scale factors to be chosen suitably large. The computation of \(R_Q\) is easily done by using the function cuhre that is included in the package cubature (Narasimhan et al. 2022) of the statistical software environment R (R Core Team 2022).

Alternatively, one can resort either to Monte Carlo simulation or to the approximation-by-discretization approach, by using one of the methods described in the previous sections.

We will regard the value of reliability recovered by numerical evaluation through the cuhre function (with all the \(\gamma \) scaling factors set equal to 9) as the actual (true) value and thus indicate it simply with R; the value obtained through Monte Carlo simulation is denoted by \({\hat{R}}^{MC}\); the values obtained by the approximation-by-discretization approaches are denoted by \({\hat{R}}\), with a superscript identifying the specific discretization: GQ (Gaussian Quadrature), DZ (Drezner and Zerom 2016), CvM (Cramér-von Mises), C (Cramér), AD (Anderson-Darling). In order to measure the degree of accuracy of each technique, one can consider the following synthetic measures, which are: Mean Deviation (MD), Mean Absolute Deviation (MAD), Root Mean Squared Deviation (RMSD), defined as follows,

$$\begin{aligned} \text {MD}&=\frac{1}{n}\sum _{i=1}^n ({\hat{R}}_i-R_i)\\ \text {MAD}&=\frac{1}{n}\sum _{i=1}^n |{\hat{R}}_i-R_i|\\ \text {RMSD}&=\sqrt{\frac{1}{n}\sum _{i=1}^n ({\hat{R}}_i-R_i)^2} \end{aligned}$$

where the subscript i refers to the i-th scenario.

We considered a number \(N=10\) millions of pseudo-random simulations for the Monte Carlo approach, and a number \(k=5\) of approximating points for the discretization approaches.

The results, reported in Table 7, show that Monte Carlo simulation provides overall the best results in terms of MAD and RMSD; the proposed discretization methods, based on the minimization of the Anderson-Darling and Cramér distances, perform better if compared to the Gaussian quadrature method in terms of MD, MAD, and RMDE; they perform worse than the method proposed in Drezner and Zerom (2016), which however requires a much considerable computational effort required by its inner numerical minimization routine and has a narrower applicability, as already remarked in Sect. 2.2. We remark that in this example we focused just on discretization techniques which are somehow comparable: according to their specific criterion, they all compute both the k support points and the corresponding k probabilities simultaneously. Other techniques, mentioned in the Introduction, such as the maximum entropy method, first assign the support points a priori and then calculate the probabilities only.

Table 7 Comparative closeness study for a hollow rectangular tube

4 Software implementation

Code in the R programming environment has been developed, which implements the different routines used for finding the optimal discrete approximations. For the Cramér distance, in particular, several parametric distributions are considered. Some functions are also available for plotting discretized distributions, which makes graphical comparison to the original continuous distribution more effective.

For the Cramér-von Mises and Anderson-Darling distances (and potentially, for any other distribution-free distance), the function Discr has to be used, whose arguments are the number of points k and the type of distance (CvM and AD). It returns a list containing the vectors of \(p_i\) and \(q_i\) (along with the vector of \(Q_i\)), from which one can extrapolate the support values \(x_i\) through the quantile transformation of the assigned distribution function F.

For the Cramér distance, the main function is called DiscrF; its arguments are the number of points k by which we want to approximate the continuous random distribution; the type of continuous distribution (a string identifying it), along with the value of its parameters (a vector, par). At the moment, a few families of continuous distributions can be selected, namely, those discussed in Sect. 2.4, for which the first-order condition on the \(Q_i\) is available in a closed-form: normal (norm), exponential (exp), Cauchy (cauchy), logistic (logis), Lomax (lomax), power function (power). The output of the function is a list, containing the vector of probabilities \(p_i\) and the vector of support values \(x_i\) of the discrete approximation, along with the vectors of \(q_i\) and \(Q_i\).

The companion function moments receives as input a vector of support points and a vector of corresponding probabilities possibly obtained as a result from DiscrF or Discr, and computes the expectation, variance, skewness and kurtosis of the related discrete rv. This is useful if one wants to keep under control the effects of discretization over the first (normalized) moments, since we know that the techniques we introduced do not offer any guarantee, in general, that the moments of the continuous rv are preserved.

The graphical function plotdist receives as its first argument the result of Discr or DiscrF and plots the corresponding discrete approximation (its pmf or its cdf, to be selected through the argument plot, to be set equal to “pmf” or “cdf”). This can be plotted over the unit square (by setting the argument xaxis equal to “q”: on the x axis the \(q_i\), on the y axis the \(p_i\) or the \(Q_i\), see Fig. 1); or on the usual \({\mathbb {R}}\times [0,1]\) space (by setting the argument xaxis equal to “x”: on the x axis the \(x_i\), which should be supplied if the first argument comes from Discr; on the y axis the \(p_i\) or the \(Q_i\)).

In the supplementary material, available at https://tinyurl.com/STPA-D-21-00382, the relevant R code along with some examples is supplied.

5 Conclusion

In this work, we discussed a class of discretization techniques that calculate a k-point discrete approximation of a continuous rv by minimizing a distance between the two cdf. For one distance in this class, the optimal discrete approximation (i.e., both the points \(x_i\) and their probabilities \(p_i\), \(i=1,\dots ,k\)) turns out to have an analytical expression for any k. For other distances, a closed-form solution is not available, but the \(x_i\) and the \(p_i\) can be iteratively computed by alternately solving two sets of equations till convergence, in a similar fashion to optimal quantization. It may happen that the solution is “distribution-free”, meaning that the probability transforms \(F(x_i)\) and \(p_i\) do not depend on the particular distribution function F selected; or, on the contrary, that the solution directly depends on F: in the latter case, we derived the sets of equations to be satisfied by the solution for a wide array of continuous random distributions.

We underline that this class of discretization techniques represents a valid alternative to the other extant procedures, among them, the consolidated moment-matching technique, whose applicability is however hindered by the often unattainable hypothesis of finiteness of the first integer moments. This class is also competitive if compared to optimal quantization and modifications thereof, since for the latter the algorithm leading to the best discrete approximation is very similar, being based on the minimization of the mean squared error between the two rvs. What we cannot expect from the suggested class of discrete approximations is the preservation of the first (and second) moment; this can be seen as a shortcoming, but only if one is used to deal with or synthesize random distributions in terms of moments or if the transformation of rvs one needs to approximate is a smooth function. In the financial or insurance fields, one is more familiar with quantiles (the most popular risk measure for market risk is the Value at Risk (VaR), which is nothing else than a quantile of a loss distribution over a fixed time horizon; see, e.g., McNeil, Frey and Embrechts 2005) and then adopting a discrete approximation which minimizes a distance between cdfs may be intuitively more appropriate and convenient. Moreover, many continuous distributions do not even possess the second or the first moment (just think of the Cauchy distribution) and then moment-matching and quantization techniques would fail in providing a discrete approximation, which is instead guaranteed by the proposed class (with rare exceptions arising if the Cramér distance is considered). The practical problem presented in the third section, concerning a complex function of several random variables, illustrates how the proposed procedures can overperform the standard moment matching approach.

Future research will examine other statistical distances between cdf, in particular asymmetrical ones, and derive the corresponding optimal k-point discrete approximation from their minimization. We will also examine possible extensions of this class of discretization techniques to the bivariate and more generally d-variate case. Although it can be naïvely judged to be straightforward, high dimensionality leads to non-negligible theoretical and computational issues: apart from the choice of the statistical distance to be minimized (an analogous quadratic distance between joint cdfs or the energy distance, see Rizzo and Székely 2016, can be considered) the most challenging matter is represented by the selection of all feasible d-variate supports for the discrete approximation (they are not necessarily a d-variate Cartesian product of univariate supports, and this complicates the evaluation of the statistical distance) and by the possible statistical dependence between the univariate components, which unavoidably calls for the use of copulas. Similar kinds of problems arise in the computation of principal points for multivariate random distributions (Flury 1990).

Another aspect that can be investigated is related to the construction of discrete approximations with “a priori” assigned support points \(x_i\). This can be useful when one wants to construct a discrete counterpart or analog to a continuous probability distribution, and not just a discrete approximation. Typically, the discrete counterpart to a continuous distribution supported on \({\mathbb {R}}\) (or \({\mathbb {R}}^+\)) is constructed as the discrete distribution supported on \({\mathbb {Z}}\) (or \({\mathbb {N}}\)) preserving the expression of the continuous cdf at the integer support points (Chakraborty 2015). Alternatively, a discrete counterpart can be constructed by considering the same integer support as above and minimizing one of the statistical distances examined in this work with respect to the probabilities \(p_i\), which now constitute a countable set. By considering any parametric random distribution with infinite support, we can thus generate several new count distributions that can be actually regarded as its discrete counterparts and can be employed for modeling count data.