1 Introduction

Verifying the quality of forecasts expressed in a probabilistic form requires specific graphical or numerical tools (Jolliffe and Stephenson 2011), among them some numerical measures of performance such as the Brier score (Brier 1950), the Kullback–Leibler divergence (Weijs et al. 2010) and many others (Winkler et al. 1996; Gneiting and Raftery 2007). When the probabilistic forecast is a cumulative distribution function (CDF) and the observation is a scalar, the continuous ranked probability score (CRPS) is often used as a quantitative measure of performance. Classically (Matheson and Winkler 1976; Hersbach 2000), the instantaneous CRPS is defined as the quadratic measure of discrepancy between the forecast CDF, noted F, and \(\mathbb {1}(x\ge y)\), the empirical CDF of the scalar observation y

$$\begin{aligned} \hbox {crps}(F,y) = \int _{\mathbb {R}}\left[ F(x) - \mathbb {1}(x\ge y)\right] ^2\hbox {d}x, \end{aligned}$$
(INT)

where \(\mathbb {1}\) is the indicator function.

Analytic formulations of \(\hbox {crps}(F,y)\) can be derived for most classical parametric distributions, some of which are listed in Table 1. In some situations, the forecast CDF may not be fully known, such as for ensemble numerical weather prediction (NWP) or other types of Monte-Carlo simulations, or the forecast CDF may be known, but an analytic formulation of the CRPS may not be derivable. In the latter case, one may be able to sample values from F. In any case, in these two situations, the forecast CDF is summarized with a set of M values \(x_{i=1,\ldots ,M}\). Following the convention in meteorology, such a set will be called here an “ensemble”, and each value \(x_i\) will be called a “member”. The instantaneous CRPS must then be estimated with this ensemble. This may be problematic when using the CRPS to compare parametric forecasts, whose CRPS may be computed exactly, and forecasts whose CRPS is estimated based on the limited information about F contained in the ensemble. The unknown error in the CRPS estimation may lead to the wrong choice of the best forecast. Although meteorological vocabulary is used, this situation can occur in other fields of geosciences too, for instance when conditional simulations are used to sample from a probability distribution and choose between competing techniques or settings (Emery and Lantuéjoul 2006; Pirot et al. 2014; Yin et al. 2016).

Table 1 List of distributions whose closed-form CRPS exists and were used in this study

Usually, the instantaneous CRPS is averaged in space and/or time over several pairs of forecast/observation. Candille (2003) and Ferro et al. (2008) showed that when the ensemble is a random sample from F, the usual estimator of the instantaneous CRPS based on Eq. (INT), introduced later, is biased: its expectation over an infinite number of forecast/observation pairs does not give the right theoretical value. This bias stems from the limited information about F contained in an ensemble with finite size M. Several solutions have been proposed to remove this bias. Ferro (2014) introduced the notion of fair score and a formula to correct the bias in the estimation of the averaged CRPS. Müller et al. (2005) proposed two solutions to the same problem of biased estimation of the ranked probability score (RPS), the version of the CRPS for ordinal random variables. Adapted to the CRPS, their first solution would be to use an absolute value instead of a square inside the integral in Eq. (INT). As demonstrated in Appendix A, this score for an ensemble is minimized if all the members \(x_i\) equal the median of F, which is obviously not the purpose of an ensemble. Their second solution is to compute the RPS skill score against some ensemble of size M whose RPS is estimated by bootstrapping past observations. Although interesting, this solution does not allow assessing the absolute performance of the ensemble, but only the performance relative to this bootstrapped ensemble.

This study aims at improving heuristically the estimation of the average CRPS of a forecast CDF under limited information. The information is limited in two ways: (i) the CDF is known only through an ensemble as defined above, and (ii) the average CRPS is computed over a finite number of forecast/observation pairs. The problem is not to estimate the unknown forecast CDF F, but to estimate the CRPS of F under limited information about F. To improve the estimation with this limited information, the usual strategy is to correct the empirical mean score, as in Ferro (2014) or Müller et al. (2005). Here the approach is to improve the estimation of each term of the average, that is, the estimation of the instantaneous CRPS \(\hbox {crps}(F,y)\).

The rest of this paper is organized as follows. Section 2 reviews several estimators of the instantaneous CRPS proposed in the literature and demonstrates relationships among them. In particular, it is shown that the four proposed estimators reduce to two only. In Sect. 3, synthetic data are used to study the variations in accuracy of these two CRPS estimators, with the size M of the ensemble and the way this ensemble is built. These simulations lead to recommendations on the best estimation of the CRPS. Section 4 illustrates issues in CRPS estimation with two real meteorological data sets. Improvements in the inference obtained by following the recommendations from Sect. 3 are shown on these data. Section 5 gives a summary of the recommendations to get an accurate estimation of the instantaneous CRPS, concludes and discusses the results.

2 Review of Available Estimators of the CRPS

The instantaneous CRPS is defined as a quadratic discrepancy measure between the forecast CDF and the empirical CDF of the observation

$$\begin{aligned} \hbox {crps}(F,y) = \int _{\mathbb {R}}\left[ F(x) - \mathbb {1}(x\ge y)\right] ^2\hbox {d}x. \end{aligned}$$
(INT)

Equation (INT) is called the integral form of the CRPS.

Gneiting and Raftery (2007) showed that, for forecast CDFs with a finite first moment, the CRPS can be written as

$$\begin{aligned} \hbox {crps}(F,y) = {\mathbb {E}}_X|X-y| -\frac{1}{2} {\mathbb {E}}_{X,X^\prime } |X-X^\prime |, \end{aligned}$$
(NRG)

where X and \(X^\prime \) are two independent random variables distributed according to F, and \({\mathbb {E}}_A\) is the expectation according to the law of the random variable(s) A. This is called the energy form of the CRPS, since it is just the one-dimensional case of the energy score introduced by Gneiting and Raftery (2007), based on the energy distance of Székely and Rizzo (2013).

Taillardat et al. (2016) introduced a third expression of the CRPS, valid for continuous forecast CDFs

$$\begin{aligned} \hbox {crps}(F,y) = {\mathbb {E}}_X|X-y| + {\mathbb {E}}_XX - 2 {\mathbb {E}}_XXF(X), \end{aligned}$$
(PWM)

which is called the probability weighted moment (PWM) form of the CRPS because its third term is a probability weighted moment (Greenwood et al. 1979; Rasmussen 2001; Furrer and Naveau 2007).

When F is known only through an M-ensemble \(x_{i=1,\ldots ,M}\), the above definitions lead to the following estimators of the instantaneous CRPS

$$\begin{aligned} \widehat{\hbox {crps}}_{\mathrm{INT}}(M, y) = \int _{\mathbb {R}}\left[ \frac{1}{M}\sum _{i=1}^M\mathbb {1}(x\ge x_i) - \mathbb {1}(x\ge y)\right] ^2\hbox {d}x , \end{aligned}$$
(eINT)
$$\begin{aligned} \widehat{\hbox {crps}}_{\mathrm{NRG}}(M,y) = \frac{1}{M} \sum _{i=1}^M|x_i-y| - \frac{1}{2M^2}\sum _{i,j=1}^M |x_i-x_j| , \end{aligned}$$
(eNRG)
$$\begin{aligned} \widehat{\hbox {crps}}_{\mathrm{PWM}}(M, y) = \frac{1}{M}\sum _{i=1}^M|x_i-y| + {\hat{\beta }}_0 - 2{\hat{\beta }}_1 , \end{aligned}$$
(ePWM)

respectively, where \({\mathbb {E}}_XX\) is estimated by \({\hat{\beta }}_0=\frac{1}{M}\sum _{i=1}^Mx_i\), and \({\mathbb {E}}_XXF(X)\) is estimated by \({\hat{\beta }}_1 = \frac{1}{M(M-1)}\sum ^M_{i=1} (i-1) x_{i}\). Without loss of generality, the members \(x_{i}\) are supposed sorted in increasing order, and the size M of the ensemble is supposed greater than two.

Candille (2003) and Ferro et al. (2008) showed that the expectation of Eq. (eINT) over an infinite number of forecast/observation pairs is biased with, under conditions of stationarity of the observation and the ensemble, and exchangeability of the members

$$\begin{aligned} {\mathbb {E}}_Y\widehat{\hbox {crps}}_{\mathrm{INT}}(M,Y) = {\mathbb {E}}_Y\hbox {crps}(F,Y) + \frac{1}{M}{\mathbb {E}}_{X_1,X_2}\frac{|X_1-X_2|}{2}, \end{aligned}$$
(1)

where \(X_1\) and \(X_2\) are any two distinct members of one ensemble forecast. This relation holds only when the ensemble is a random sample from F. Ferro (2014) proposed the notion of fair score for an ensemble of random values, which leads to a fourth estimator of the instantaneous CRPS, the fair CRPS defined as

$$\begin{aligned} \widehat{\hbox {crps}}_{\mathrm{Fair}}(M, y) = \frac{1}{M}\sum _{i=1}^M|x_i-y| - {\hat{\lambda }}_2 , \end{aligned}$$
(eFAIR)

where \({\hat{\lambda }}_2 = \frac{1}{2M(M-1)}\sum _{i,j=1}^M|x_i-x_j|\) estimates \({\mathbb {E}}_{X_1,X_2}\frac{|X_1-X_2|}{2}\), and is unbiased when the members are independently sampled from F.

These four estimators reduce to only two since, as shown in Appendix B

$$\begin{aligned} \widehat{\hbox {crps}}_{\mathrm{INT}}(M,y)&= \widehat{\hbox {crps}}_{\mathrm{NRG}}(M,y), \\ \widehat{\hbox {crps}}_{\mathrm{PWM}}(M,y)&= \widehat{\hbox {crps}}_{\mathrm{Fair}}(M,y). \end{aligned}$$

The properties of only two estimators have to be studied. In light of the second equality, the fair CRPS can be interpreted as a PWM-based estimator of the instantaneous CRPS, which explains why it is an unbiased estimator of the average CRPS of a random ensemble as proven by Ferro (2014). Indeed, the unbiasedness property of the mean for the first term and of the PWMs for the second term, in the case of a random sample, immediately proves that the two terms in Eq. (ePWM) are unbiased estimators of their population counterpart, if the members are randomly and independently drawn from F.

Moreover, the relationship

$$\begin{aligned} \widehat{\hbox {crps}}_{\mathrm{INT}}(M,y) = \widehat{\hbox {crps}}_{\mathrm{PWM}}(M, y) + \frac{{\hat{\lambda }}_2}{M} \end{aligned}$$
(2)

holds for these two estimators, as shown in Appendix B. Equation (2) holds for a single forecast/observation pair, and requires no assumption on the nature or statistical properties of the ensemble.

3 Study with Simulated Data

The accuracy of the two instantaneous CRPS estimators presented above, \(\widehat{\hbox {crps}}_{\mathrm{PWM}}(M, y)\) and \(\widehat{\hbox {crps}}_{\mathrm{INT}}(M, y)\), is studied with synthetic forecast/observation pairs. The forecast CDF F is chosen such that the theoretical CRPS \(\hbox {crps}(F,y)\) can be exactly computed with a closed-form expression (see Table 1 for a list of such distributions). To mimic actual situations when F is not fully known, two types of ensembles are built from this forecast CDF. The two types of ensembles successively used in the remaining of this section are random ensembles and ensembles of quantiles, defined later. The estimators are then computed and compared to the theoretical value.

3.1 CRPS Estimation with a Random Ensemble

3.1.1 Methodology

A random ensemble is a sample of M independent draws from F. In actual applications, a random ensemble may be viewed as M members from an NWP ensemble model, or, more generally, as an M-sample from Monte-Carlo simulations. Protocol 1 describes the simulation plan.

figure a

3.1.2 Results

The results are presented for a standard normal forecast CDF F. For the sake of simplicity the CDF of the observation is also standard normal (\(G=F\)).

Since the ensemble is random, the estimated CRPS is also a random variable that depends on the observation y and the members \(x_{i=1,\ldots ,M}\). In order to study the variability of the estimated CRPS with the ensemble only, the observation is first held constant (with a value of − 0.0841427, for each n in Protocol 1), while \(N=1000\) ensembles of M members are drawn from F. The impact of M on the accuracy of the estimated CRPS is assessed by observing Protocol 1 with different ensemble sizes M.

The point-wise 10, 50 and 95% intervals of the estimation error \(\hbox {crps}_{\mathrm{th}}-\widehat{\hbox {crps}}\) (with \(\hbox {crps}_{\mathrm{th}}=0.2365178\) here) are computed over these 1000 ensembles for each ensemble size M. The intervals contain the corresponding proportion of the 1000 computed CRPS errors for a given ensemble size. As shown in Fig. 1 (left) for \(\widehat{\hbox {crps}}_{\mathrm{INT}}\), the error tends toward 0 when the ensemble size increases. However, important errors (as high as \(\pm \,10\%\) of \(\hbox {crps}_{\mathrm{th}}\)) can still occur even for very large ensembles of several hundreds of members. As shown in Fig. 1 (right), the estimator \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) exhibits a similar behaviour for large random ensembles, as deduced from Eq. 2 if \(M\rightarrow \infty \). But \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) becomes unbiased for much smaller ensemble sizes than \(\widehat{\hbox {crps}}_{\mathrm{INT}}\). The unbiasedness of \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) proven by Ferro (2014) holds only for ensembles with more than about 20 members. The variability of the estimation, as quantified by the half-width of the 50% central interval, may be important when the random ensemble contains less than 50 members (more than 10% of \(\hbox {crps}_{\mathrm{th}}\), in Fig. 2). With increasing ensemble sizes, the variability of this estimation does not scale linearly with the number of members, as shown in Fig. 2. Tripling the ensemble size from \(M=100\) to about \(M=300\) decreases the half-width of the 50% central interval of the relative estimation error by only about 2% (from 7 to 4%).

Fig. 1
figure 1

Intervals of estimation error of \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) (left) or \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) (right) for a random ensemble of varying size. Intervals are computed point-wise, with the 1000 CRPS of independently built random ensembles with the same observation. The observation and members come from a standard normal distribution

Fig. 2
figure 2

Same as in Fig. 1, but for the relative estimation error of \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\)

Common practice is to average instantaneous CRPSs over several locations and/or times. Here, this is mimicked by taking the average of N instantaneous CRPSs generated according to Protocol 1, while no longer holding the observation constant. The number of forecast/observation pairs N is varied from 1 to 1000. The size M of the ensemble is also varied, with 10, 30, 50, 100 and 300 members. The average theoretical CRPS and average estimation are computed for each combination of N and M. As shown in the left of Fig. 3 for \(\widehat{\hbox {crps}}_{\mathrm{INT}}\), a stable estimation of the average CRPS is reached if the number of averaged estimations is large enough (more than 300 for a random ensemble of 10 members). But a large ensemble is required to get an accurate estimation of the true average CRPS. As shown in the right of Fig. 3, the averaged \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) shows a better estimate than the averaged \(\widehat{\hbox {crps}}_{\mathrm{INT}}\), even for small ensembles and small numbers of averaged estimations.

Fig. 3
figure 3

Evolution of the relative estimation error of the averaged \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) (left) or \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) (right) with the number of members for a random ensemble. The averaged CRPS is an arithmetic mean of the CRPS of several pairs of ensemble/observation among 1000. The vertical grey dashed lines correspond to an average computed with 30, 90 and 365 ensembles (to mimic a monthly, seasonal or yearly average CRPS)

These behaviours for the instantaneous and the averaged estimates remain true for every distribution listed in Table 1, every parameter value and even if the G and F are different (not shown).

The added value of these simulations to the results of Ferro (2014) is to show the behaviour of \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) for small ensemble sizes M and finite numbers of forecast/observation pairs. The poor scaling of this estimator’s variability with the ensemble size has been empirically shown, which had never been done, to the best of our knowledge. Finding a formula for the variability of \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) would be interesting to quantify the estimation uncertainty for practical purposes. Theoretical error bounds have been demonstrated but are not usable in practice since they require to know the forecast distribution (not shown).

The conclusion of these simulations is that, for a random ensemble, the estimation of the instantaneous CRPS is not very accurate whatever estimator is used, but the averaged CRPS can be estimated with a good accuracy. The unbiasedness of \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) for random ensembles stems from the use of estimators that are unbiased for independent samples from the underlying distribution F. In practice, if one seeks to estimate the potential performance of an ensemble with an infinite number of members, one should use the PWM estimator of the CRPS. The integral estimator of the CRPS assesses the global performance of the actual ensemble, and should be used for actual performance verification.

3.2 CRPS Estimation with an Ensemble of Quantiles

3.2.1 Methodology

An ensemble of M quantiles of orders \(\tau _{i=1,\ldots ,M}\in [0;1]\) is a set of M values \(x_{i=1,\ldots ,M}\) such that: \(x_i=F^{-1}(\tau _i) \,\forall i \in \{1,\ldots ,M\}\). Contrasting with a random ensemble, the orders \(\tau _i\) associated to the members \(x_i\) are known.

In this case, the data are simulated according to Protocol 2. The two built ensembles of quantiles are defined as:

  • regular ensemble (reg): it is the ensemble of the M quantiles of orders \(\tau _i\), with \(\tau _i \in \{\frac{1}{M}, \frac{2}{M},\ldots ,\frac{M-1}{M},\frac{M-0.1}{M}\}\) of F. The last order is not 1 to prevent infinite values.

  • optimal ensemble (opt): it is the set of M quantiles of orders \(\tau _i \in \{\frac{0.5}{M},\frac{1.5}{M},\ldots ,\frac{M-0.5}{M}\}\) of F. This ensemble is called “optimal” because Bröcker (2012) showed that this set of quantiles minimizes the expectation of the CRPS of an ensemble over an infinite number of forecast/observation pairs, when using Eq. (eINT).

figure b

3.2.2 Results

Relative estimation errors of \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) and \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) have been computed for a fixed observation (\(N=1\), \(y=-\,0.0841427\)) and regular and optimal ensembles, all built from a standard normal distribution (\(G=F\) for the sake of simplicity). As shown in Fig. 4, the CRPSs estimated with quantile ensembles clearly outperform the \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) estimation with one random ensemble whatever the number of members. Averaging the \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) estimations of 1000 random ensembles gives a similar estimation accuracy to the one of the best estimation with quantile ensembles, namely \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) with optimal quantiles. This configuration is not feasible in most applications, since it requires 1000 forecast/observation pairs with the same observation. Anyway, computing one set of quantiles may be much simpler and quicker than creating 1000 random ensembles. Among the estimation with ensembles of quantiles, the combination of \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) and optimal quantiles exhibits a dramatic improvement in accuracy over the other combination, even for ensembles with less than 10 quantiles. Whatever the distribution F is used, \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) computed with the optimal quantiles gives a much more accurate estimation, for all ensemble sizes, than the other combinations of estimator and type of ensemble of quantiles (not shown).

Fig. 4
figure 4

Evolution with ensemble size of relative error of several estimations of CRPS, for different ensembles and different estimators of CRPS. All computation are done with the same observation for all forecasts. The ensembles and the observation come from a standard normal distribution

In order to assess the robustness of the remarks in the last paragraph in regards to the observation, data are simulated with Protocol 2 for several ensemble sizes M, with \(N=1000\) ensemble forecast/observation pairs for each ensemble size. Note that, at M fixed, the ensemble of quantiles is the same for all the forecast/observation pairs. From the point-wise intervals of the relative estimation errors represented in Fig. 5, it appears that computing \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) with the optimal quantiles gives the most accurate estimation of \(\hbox {crps}(F,y)\), whatever number of quantiles is used. With only a few tens of quantiles, this estimation achieves a much higher precision than the others with several hundreds of quantiles. Figure 5 also shows that, for finite ensembles of quantiles, the PWM estimator is biased, being too low (positive relative errors). Indeed, according to Eq. (2), since \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) is an unbiased estimator of the average CRPS of an ensemble of quantiles as shown here, and since \(\widehat{\lambda }_2\) is positive, \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) must be biased towards low values.

Fig. 5
figure 5

10, 50, 95 and 100% point-wise intervals of relative error for several combination of quantile ensembles and CRPS estimator. Intervals are computed by drawing 1000 observations from a standard normal distribution. Ensembles are regular (left column) or optimal (right column) quantiles of a standard normal distribution. The CRPS is estimated with the PWM (top) or integral (bottom) estimator

These conclusions hold for all the tried distributions and the set of parameters values for each distribution (not shown). As for the poor performance of \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) with an ensemble of quantiles, let us recall that \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) is a sum of terms that are unbiased estimators of their population counterpart when computed with a random sample, which is not the case of an ensemble of quantiles. The computation of \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) uses the approximation of the forecast distribution as a step-wise CDF, with a fixed stair-step height \(\frac{1}{M}\). The difference in estimation accuracy with the type of quantiles comes from the position of the stair steps. With regular quantiles, the step-wise CDF is always located below the forecast CDF. With optimal quantiles, the associated quantiles are shifted leftward, making the stair steps sometimes above F and sometimes below. This better approximates the forecast CDF F than with regular quantiles, thus improves the estimation of the CRPS.

3.2.3 Influence of Ties in an Ensemble of Quantiles

An ensemble of quantiles may be produced by statistical methods called quantile regression (White 1992; Koenker 2005; Meinshausen 2006; Takeuchi et al. 2006). Some of these quantile regression methods can produce only a subset \(\tau ^{av}_{j=1,\ldots ,N_{\tau }} \in [0;1]\) of \(N_\tau \) orders. The quantiles associated to these available orders are called “available quantiles” hereafter and correspond to the abscissa of the black dots in Fig. 6. If one requires a quantile with an order \(\tau \) outside of the subset of available orders, the quantile regression will not return the associated quantile of the forecast CDF (abscissa of the blue circles in Fig. 6), but the available quantile corresponding to the highest available order lower than \(\tau \) (abscissa of the red triangles of Fig. 6). The set of different values returned by the quantile regression method when certain orders are requested is called the “unique quantiles” hereafter. It is a subset of the available quantiles. The quantile regression methods with this feature will introduce many ties in the produced ensembles of quantiles, as shown in Fig. 7 on real data. For the Canadian ensemble forecasts, although 1002 regular quantiles are required from a quantile regression method at one grid point and one lead time, the number of unique quantiles returned by the quantile regression function varies from a few tens of values to a few hundreds. On average, only about one hundred unique quantiles are produced in this example. Some implementations of quantile regression methods, such as the function rq in R package quantreg, have an option to produce the available orders \(\tau _j^{av}\) and their associated quantiles. Other packages, such as quantregForest, have not yet implemented this possibility, and will return only forecast quantiles with (potentially many) ties.

Fig. 6
figure 6

Graphical illustration of the production of ties by quantile regression methods. The black continuous line is the forecast CDF. The abscissa (resp. ordinates) of the \(N_\tau =4\) black dots are the available quantiles (resp. orders), that can produce the quantile regression method. The empty blue dots are the five requested points. The red triangles are the five points actually obtained, due to the limited number of available quantiles and orders. Within each group of obtained points whose abscissa is the same, only the point with the lowest order is kept (three red diamonds) for removing the ties by interpolation. The interpolation function (dashed red line) is a linear interpolation between the red diamonds, and a constant order of 0 or 1 outside (left and right, respectively)

Fig. 7
figure 7

Number and percentage of unique quantiles among 1002 regular quantiles requested from a quantile regression method applied to the Canadian ensemble model

Fig. 8
figure 8

Same as in Fig. 5 but with ties in the ensembles and only \(N_\tau = 30\) available orders (left), and after removing ties by linear interpolation of the unique quantiles in the forecasts (right)

Fig. 9
figure 9

Same as in Fig. 5, but with ties in the ensembles, only \(N_\tau = 30\) available orders, and linear interpolation of the \(N_\tau \) available quantiles

Fig. 10
figure 10

Influence of post-processing. The ensemble quantiles are post-processed by linear interpolation between unique quantiles (linjitter) or between the \(N_\tau \) available quantiles (fulljitter). Each panel represents the same intervals as in Fig. 5 for \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) computed from post-processed quantile ensembles with a varying number \(N_\tau \) of available quantiles

In order to assess the impact of ties on the accuracy of the CRPS estimators for an ensemble of quantiles, ensembles of quantiles with ties are simulated with Protocol 3, with \(N=1000\) forecast/observation pairs. The left side of Fig. 8 shows that with only \(N_\tau =30\) available orders, the four estimates become inaccurate. The distribution of the estimated CRPS becomes clearly biased whatever ensemble size is considered. This bias is pessimistic (negative estimation errors) for most ensemble sizes, but may be optimistic (positive estimation errors).

figure c

A way to address this issue of equal quantiles is to remove the ties by interpolation. The first considered case is when the implementation of the quantile regression method do not propose to know the available quantiles. Protocol 3 is modified as follows at lines 3 and 4: after computing the quantiles with ties, linear interpolation is done between unique values to recover the number of required regular or optimal quantiles, as explained in Fig. 6. As shown on the right side of Fig. 8, this interpolation results in a better estimation accuracy, even though the curves are less smooth than when all orders are available (compare with Fig. 5). The best CRPS estimation is now obtained with \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) and regular quantiles, with at least \(M=30\) regular quantiles to get a sufficient accuracy. This behavior barely depends on the chosen distribution and parameter value, but requiring 100 regular quantiles seems to be the minimal number to get satisfactory accuracy, whatever the forecast distribution F is used (not shown). If the available quantiles and orders can be produced by the implementation of the quantile regression method, similar linear interpolation can be done relatively to the associated points, that is, the black dots in Fig. 6. Figure 9 shows that this linear interpolation nearly fully reproduces the good accuracy obtained when all orders are available. The best estimation strategy is again to use \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) with optimal quantiles, albeit with a slightly worst accuracy than the one reached without ties.

The influence of the number of available orders \(N_\tau \) and the kind of post-processing on \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) is crucial as shown in Fig. 10. If the number of available quantiles is too low, no matter the post-processing of the quantile ensemble, the estimated CRPS will not converge to the true value due to insufficient information about F. The number of available quantiles necessary to achieve a good accuracy depends on the complexity of the forecast distribution: a gaussian mixture with many different modes requires more available quantiles to be accurately described (not shown here).

Based on these simulations, several recommendations can be drawn to estimate the instantaneous CRPS of an ensemble of quantiles. First, if the quantile regression cannot yield enough available quantiles (less than about \(N_\tau =30\)), the instantaneous CRPS should not be used whatsoever. Even the average CRPS should be used with care due to a (possibly large) estimation bias. However, if the number of available unique quantiles is sufficient (more than 30), the estimation of the instantaneous CRPS can be much improved by interpolating the quantiles and using of \(\widehat{\hbox {crps}}_{\mathrm{INT}}\). The best interpolation depends on the available information: if the whole set of available quantiles in the quantile regression method is not accessible, linear interpolation between the unique quantiles and their associated order toward regular quantiles is preferred. However, if the available quantiles and orders can be known, linear interpolation of those quantiles and orders toward optimal quantiles is the best approach. Table 2 sums up the recommendations to estimate the instantaneous CRPS for a random ensemble or an ensemble of quantiles.

Table 2 Summary of recommendations to estimate the CRPS

4 Real Data Examples

With two real data sets, issues resulting from the uncertainty in the estimation of the instantaneous CRPS are illustrated. The practical benefits of the recommendations listed in Table 2 are highlighted.

4.1 Raw and Calibrated Ensemble Forecast Data Sets

The first forecast data set consists in four NWP ensembles from the TIGGE project (Bougeault et al. 2010). Ten-meter high wind speed forecasts have been extracted from four operational ensemble models issued by meteorological forecast services: the US National Centers for Environmental Prediction (NCEP), the Canadian Meteorological Center (CMC), the European Center for medium-range weather forecasts (ECMWF) and Météo-France (MF). Those ensembles have respectively 21, 21, 51 and 35 members. The study domain is France with a grid size of 0.5\(^\circ \) (about 50 km), for a total of 267 grid points. Available forecast lead-times are every 6 h. The period goes from 2011 to 2014.

The second forecast data set is composed of two versions of each ensemble calibrated with statistical post-processing methods. In order to improve the forecast performance, each ensemble has been post-processed thanks to two statistical methods: nonhomogeneous regression [NR, Gneiting et al. (2005)] and quantile regression forests [QRF, Meinshausen (2006)]. In NR, the forecast probability distribution F is supposed to be some known distribution: here the square root of forecast wind speed follows a truncated normal distribution whose mean and variance depend on the ensemble forecast. This is similar to the work of Hemri et al. (2014), who also gives the closed form expression of the instantaneous CRPS for this case. QRF is nonparametric and yields a set of quantiles \(x_i\) with chosen orders \(\tau _i\). This study uses a simplified version of the model proposed in Taillardat et al. (2016). Since QRF is nonparametric, the CRPS has to be estimated with limited information. Furthermore, QRF cannot yield every order and may lead to many ties among predicted quantiles, as seen in Fig. 7. To the best of our knowledge, no implementation of QRF in R allows knowing the available quantiles. Post-processing was done separately for each of the 267 grid points, each ensemble and each lead time. The regression was trained with cross-validation: 3 years were used as training data, the fourth one being used as test data. The four possible combinations of three training years and one test year were tested. The raw ensembles can be seen as random ensembles whereas the ensembles calibrated with QRF are ensembles of quantiles as defined above. The observation comes from a wind speed analysis made at Météo-France, presented in Zamo et al. (2016).

4.2 Issues Estimating the CRPS of Real Data

In Figs. 11 and 12, the CRPS is estimated with the first M members of the raw CMC ensemble at one grid point and for lead time +42 h. First, as shown in Fig. 11, for very small ensemble sizes, differences between \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) and \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) may be huge. With an increased ensemble size, both estimators get very similar values. Even for the largest number of members, \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) is systematically higher than \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\), in agreement with Eq. (2). These differences result in important differences on the averaged CRPS, as shown in Fig. 12, representing the evolution with M of the yearly averaged \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) and \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\). Whereas the yearly averaged \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) is nearly independent of M, the average \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) requires a minimum ensemble size to yield a stable value. But even then, the two estimators do not yield the same average CRPS: for the year 2011, on average \(\widehat{\hbox {crps}}_{\mathrm{INT}}(M=21) \simeq 0.75\,\hbox {m/s}\) whereas \(\widehat{\hbox {crps}}_{\mathrm{PWM}}(M=21) \simeq 0.7\,\hbox {m/s}\), a difference of 7%. These conclusions from Fig. 12 are in agreement with those from Fig. 3, that shows that the average \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) attains the true value with much smaller ensembles than \(\widehat{\hbox {crps}}_{\mathrm{INT}}\). The left side of Fig. 3 exhibits negative estimation errors, which is in agreement with the averaged \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) being higher than the averaged \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) in Fig. 12 and in agreement with Eq. (2).

Fig. 11
figure 11

Scatter plots of instantaneous CRPS computed for raw ensemble forecasts, with the two CRPS estimators. The forecasts are for one grid point of the Canadian ensemble forecast model. The number of members goes from 2 to 21 (the actual size of the ensemble). Each point corresponds to one forecast (one date and valid time)

Fig. 12
figure 12

Evolution of the yearly averaged CRPS with the number of members for the raw CMC ensemble. Each panel contains the average CRPS computed by averaging the instantaneous CRPS estimator, \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) (left) or \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) (right). Each curve is computed by averaging the estimated instantaneous CRPS over one test year, for forecasts at one grid point and for one lead time

Figure 13 uses the version of the CMC ensemble calibrated with QRF. For each of the two sets of quantiles, \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) and \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) are computed for each forecast date and averaged over each test year. The number M of requested quantiles is varied from 2 to 50 and are either of regular or optimal orders. Figure 13 shows the evolution of the four estimated average CRPS with the number of quantiles, for the same grid point and lead time as above. First, the average \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) decreases rapidly toward some value, whatever the type of quantile. Second, the yearly averaged \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) is not independent of the number of quantiles, as it was independent of the number of members in Fig. 12. Here, it slowly increases toward some value for a fixed type of quantile. Third, the limit values are on average \(\widehat{\hbox {crps}}_{\mathrm{PWM}}(50) \simeq 0.48\,\hbox {m/s}\), \(\widehat{\hbox {crps}}_{\mathrm{INT}}(50) \simeq 0.47\,\hbox {m/s}\) a difference of only 2%. Last, the rate of evolution of the average CRPS with the ensemble size strongly depends on the choice of the CRPS estimator and of the type of required quantiles. For these data, removing ties in the forecast quantiles does not change the conclusions (not shown). In agreement with the recommendations from the simulated data, the fastest converging estimate is the average \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) computed with optimal quantiles. Other ensembles, grid points and lead-times give similar results (not shown).

Fig. 13
figure 13

Evolution with the number of members of the estimated CRPS averaged over 1 year for CMC ensemble forecast calibrated with quantile regression forests. Two sets of quantiles are requested: regular (left) and optimal (right). Ties between quantiles are not removed. Equations (eINT) (top) and (ePWM) (bottom) are used to estimate the instantaneous CRPS for each quantile sets. Each curve is then computed by averaging the estimated instantaneous CRPS over 1 year, for forecasts at one grid point and for one lead time

4.3 Issues on the Choice Between QRF and NR

For the real data set, the CRPS of QRF has been estimated with \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) and \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) computed with optimal quantiles, and ties have been kept or removed by interpolation. Figure 14 shows the proportion of times QRF gets a lower CRPS than NR, out of the 365 forecasts during test year 2012, for one grid point and one lead time with calibrated CMC data. The proportion of times QRF outperforms NR strongly depends on the number of quantiles, but stabilizes at similar values when \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) or \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) is used. In agreement with the conclusions on simulated data, the proportion stabilizes with less quantiles when \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) is used. With too few quantiles (less than about 20), the difference of performance between QRF and NR may be deemed significant depending on the estimator. But in this specific case, after the curves have stabilized, the performance of QRF and NR are not statistically different to the level 0.01 for all the estimations. This shows that the choice of the best post-processed forecast may be misguided by poor performance estimates if the wrong estimator is used and/or not enough quantiles are required. The number of available quantiles is unknown, but has been estimated to be at least 52 for this test year. Based on the recommendations in Table 2, the best method to estimate the CRPS of QRF would be to use \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) and at least 30 optimal quantiles, which is in agreement with the previous remarks.

Fig. 14
figure 14

Proportion of forecasts when QRF gets a lower CRPS than NR, for calibrated CMC ensemble, at one grid point, for one lead time and one test year. QRF yields M optimal quantiles and its CRPS is estimated with \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) (continuous line) or \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) (dashed line), without removing ties (black curves) or after removing ties with linear interpolation between unique quantiles (red curves). NR’s CRPS is computed with the closed form expression available in Hemri et al. (2014). The grey zone is the 0.01-confidence interval that the proportion is not significantly different from 0.5 (quantiles 0.995 and 0.005 of a binomial distribution with 365 tries)

5 Conclusions

A review of four estimators of the instantaneous CRPS when the forecast CDF is known through a set of values have been done. Among these four estimators proposed in the literature, only two, called the integral estimator and the probability weighted moment estimator, are not equal. Furthermore, a relationship between these two estimators has been demonstrated and generalizes to the instantaneous CRPS of any ensemble, a relationship established by Ferro et al. (2008) for the average CRPS of a random ensemble. With simulated data, the accuracy of the two estimators has been studied, when the forecast CDF is known with a limited information and the number of forecast/observation pairs is finite. The study leads to recommendations on the best CRPS estimator depending on the type of ensemble, whether random or a set of quantiles. For a random ensemble, the best estimator of the CRPS is the PWM estimator \(\widehat{\hbox {crps}}_{\mathrm{PWM}}\) if one wants to assess the performance of the ensemble of infinite size, whereas the integral estimator \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) must be used to assess the performance of the ensemble with its current size. For an ensemble of quantiles, ties introduced by quantile regression methods strongly affect the estimation accuracy, and removing these ties by an interpolation step is paramount to allow a good estimation accuracy. If the number of available quantiles is too low (say, \(N_{\tau } \le 30\)) all the studied estimators exhibit a strong bias. But if the number of available quantiles is larger, the best estimation is obtained by computing the integral estimator \(\widehat{\hbox {crps}}_{\mathrm{INT}}\) with linearly interpolated quantiles, between the available quantiles if they are known or between the unique quantiles otherwise.

The established relationships between the estimators proposed in the literature have been linked to previous results. These relationships also explain why an estimator is more accurate for one type of ensemble and not for the other. The PWM estimator performs better on random ensembles because it is based on estimators that are unbiased for independent samples from the true underlying distribution. On the other hand, the integral estimator gives a good estimate when computed with optimal quantiles. This is because regular weights are associated to the members in the estimator formula but, when using optimal quantiles, the associated quantiles are shifted to better approximate the underlying forecast CDF.

The important consequences on the choice of method of estimation of the CRPS has also been illustrated on real meteorological data with raw ensembles and calibrated ensembles. As an example, the comparison of several calibrated ensembles may be mislead by a poor estimate of the average CRPS of ensembles of quantiles.