1 Introduction

Meteorological disasters caused by extreme weather and extreme climate events have become one of the major problems faced by human beings since the twentieth century. For example, the flood disaster in the Yangtze-Huaihe River Basin of China in the summer of 1998 and the freezing rain and snow disaster in southern China in the winter of 2008 (National Climate Center 1998; Li and Gu 2010) led to great losses to the national economic and social development. Currently, numerical models have made great progress in the medium-range and short-term weather forecasts and the forecasts longer than the seasonal scale. However, there is still a gap between the 2-week forecasts and seasonal forecasts. Therefore, the 15–60 day sub-seasonal to seasonal (S2S) prediction is getting increasing attention worldwide (Vitart et al. 2017; Zhou et al. 2019).

In 2015, the World Weather Research Program and the World Climate Research Program jointly launched an international project of S2S prediction (Vitart et al. 2017). This project aims to enhance the 15–60 day prediction skills, improve the overall understanding of high impact weather events, such as the tropical low-frequency Madden–Julian Oscillation, monsoon, and extreme precipitation, and promote the relevant research carried out by international operational forecast centers and institutions. At present, the operational forecast centers in 11 countries have participated to the S2S project and released a large number of historical hindcast and real-time forecast data of the models (http://www.s2sprediction.net/). The China Meteorological Administration (CMA) also participated in the S2S project, submitted the experiment data based on the Beijing Climate Center Climate System Model (BCC_CSM1.2) of the National Climate Center, and undertook the task of Asian database center for the S2S project. As one of the important issues in this S2S project, the sub-seasonal prediction (15–60 day) of precipitation has attracted wide attention (Ebert 2001). However, the understanding of the predictability and prediction method of precipitation is limited.

Nowadays the weather numerical model performs better in the daily forecast of the geopotential height, air temperature, and precipitation in the leading time of about a week (Saha and Van den Dool 1988; Qin and Van den Dool 1996; Buizza 2008; Schmeits and Kok 2010). Considering the interactions of the atmosphere with the ocean, land surface, and sea ice, the climate model can predict the average and variability of meteorological factors longer than the seasonal scale (Collins and Coauthors 2006; Wu et al. 2013). However, due to the chaos in the atmosphere (Lorenz 1963, 1982; Chou 1989; Hoffman 2002), the inevitable initial value errors and model errors lead to forecast biases in the weather and climate models. Thus, the ensemble mean and probabilistic forecast based on multiple initial values or multiple models are usually carried out to represent the forecast uncertainty caused by one initial value and model errors (Gneiting and Raftery 2005). In the last 20 years, the ensemble forecasting methods such as the Monte Carlo forecast method (Leith 1974), Time-Lagged Average Forecast method (LAF, Hoffman and Kalnay 1983), breeding growing mode method (Toth and Kalany 1993, 1997), singular vectors method (Molteni et al. 1996), ensemble Kalman filter method (Houtekamer and Mitchell 1998), stochastically perturbed parameterization tendencies method (Buizza et al. 1999a, b), multi-model ensemble prediction method (Fritsch et al. 2000), and machine learning approach (Hwang et al. 2019) have gradually become important tools to improve the skill of the weather forecasts, S2S prediction, and even long-term climate change simulation in the national operational centers, which include the National Centers for Environmental Prediction (NCEP), the European Center for Medium-Range Weather Forecasts (ECMWF), the United Kingdom Met Office (UKMO), the Japan Meteorological Agency (JMA), and the Chinese Meteorological Administration (CMA) (Sivillo et al. 1997; Moore and Kleeman 1998; Krishnamurti et al. 2000; Yang 2001; Buizza 2019; Zhang et al. 2021).

For the prediction from day 7 to 60, previous studies have discussed the influence of ensemble forecasting methods, such as the ensemble probabilistic prediction (Pan and Van den Dool 1998; Chessa and Lalaurette 2001), the conditional nonlinear optimal perturbation ensemble (defined as a kind of initial perturbation which makes the cost function acquire their maximum under an initial constraint condition; Jiang et al. 2009), the weather type ensemble forecast (designed for the purpose of post-processing forecast output from ensemble prediction systems and understanding how forecast models perform under different circulation types; Neal et al. 2016), predictability-based extended-range ensemble prediction (proposed for the predictable components and random components obtained with different ensemble prediction strategies; Zheng et al. 2012), on the prediction skill of geopotential height, wind, and air temperature.

Other studies have analyzed and evaluated the impact of ensemble forecast methods on the sub-seasonal precipitation prediction skill (Hamill et al. 2004; Whitaker et al. 2006; Vitart and Molteni 2009; Jie et al. 2013; Bombardi et al. 2017; Liang and Lin 2018; Li et al. 2019). However, the improvement of weekly to sub-seasonal precipitation prediction is limited compared to other time-scale forecasts (Tan and Chen 2013; Jie et al. 2013), and the statistical postprocessing ensemble of precipitation is far more challenging than that of weather variables like surface temperature or wind speed (Scheuerer 2014). At present, both the ensemble probabilistic forecast method and ensemble mean method are not good enough in the day 7 to sub-seasonal precipitation prediction due to the excessive increase of the ensemble spread after 1 week (Jie et al. 2014). For the ensemble probabilistic forecast, a few of the ensemble probabilistic thresholds become less skillful with lead times (Buizza et al. 1999a, b; Hamill et al. 2008). In the late period of forecasting (high lead times), the precipitation is significantly underestimated (overestimated) by the forecast with a high (low) probabilistic threshold (Jie et al. 2014). For the ensemble mean forecasting, the precipitation bias is more likely to be caused by the false extreme precipitation predicted by a certain ensemble member if the ensemble size is not large enough (Jie et al. 2014). Considering this, Jie et al. (2014) proposed a method of Deterministic Ensemble Forecast using a Probabilistic Threshold (DEFPT), which selects ensemble members through a certain ensemble probabilistic threshold. It can greatly improve the 6–15 day forecast skill in summer precipitation of different intensities in China, although the spatio-temporal variation of the probabilistic threshold is not considered and only the applicability of this method in a time-lagged ensemble system is verified. Meanwhile, it can avoid the influence of the false extreme value of the precipitation forecasted by a certain ensemble member on the ensemble forecasting. However, to some extent, there are still deviations in the precipitation prediction by using the DEFPT method based on a same probabilistic threshold from ensemble members in different regions, which may be related to the different regional systematic forecast errors of the model.

In this work, the quantitative objective statistical methods are used to explore the spatio-temporal variation of the available probabilistic forecast information in the sub-seasonal forecast. The credible ensemble members (i.e. smaller biases and more skillful members) are selected, based on the spatio-temporal variation, and the optimal ensemble strategy is provided for different regions to be used in the S2S precipitation ensemble forecast. The applicability of the method in different S2S operational models is verified. This article is organized as follows: Model data and ensemble methods are introduced in Sect. 2; the verification and evaluation of the ensemble methods are provided in Sect. 3; the results are explored in Sect. 4; and Sect. 5 provides a summary and discussion.

2 Model data and ensemble methods

2.1 Model data

In this study, the precipitation data from hindcast experiments from eleven operational prediction models in the S2S project are used (Table 1). All the data cover the period of 1999 to 2010, and are downloaded from http://www.s2sprediction.net/. Although the S2S models have different horizontal resolutions, each operational center uploaded model output to the S2S database archiving centers with a unified horizontal resolution 1.5º × 1.5º except the BoM model with a lower resolution 2.5º × 2.5º. As the Institute of Atmospheric Sciences and Climate of the Italian National Research Council only provided a single re-forecast sample and there are some errors in the ensemble forecast data submitted by the Hydrometeorological Center of Russia (Jie et al. 2017), the ensemble forecasting methods are evaluated and analyzed based on only nine operational models. In this study, the observed rainfall-gauge data over China are interpolated to the corresponding horizontal resolution of each S2S model, and daily accumulated precipitation from each S2S model is analyzed. In addition, eight models with longer re-forecast length than the NCEP (1999–2010) are further examined (Table 3).

Table 1 Operational models of the S2S project

2.2 Ensemble methods

In this paper, the Deterministic Ensemble Forecast using an Optimal Probabilistic Threshold (DEFOPT) method is proposed for the S2S (15–60 days) precipitation prediction. The DEFOPT is different from the traditional probabilistic forecast method as it does not predict the probability of precipitation event in each grid, but it uses the available probabilistic forecasting information in the S2S real-time scale to decide how many ensemble members predicting the occurrence of rainfall event should be trusted, and then determines the optimal ensemble forecast in different regions. The details are as follows.

First, in order to avoid the excessive overestimation or underestimation in the ensemble probabilistic forecast of the rainfall event at a certain intensity (the precipitation with the threshold of 1 mm) at each grid point, the rainfall forecasting frequency bias for the probabilistic threshold Pc is limited by a quantitative objective evaluation method—BIA score (see Appendix 1 for details) based on the multi-year hindcast results. This limitation is \(\mathrm{\alpha }\le BIA\left({\mathrm{P}}_{c}\right)\le\upbeta \), where α and β are empirical coefficients artificially selected according to the BIAs of the forecasts using different probabilistic thresholds from each model (e.g. Fig. 1 for ≥ 1 mm; Fig. S1 for ≥ 5 mm in the supplementary material). In this study, α and β are first tuned to have good performances of the DEFOPT in the climatology, and then are used to examine the hindcasts outside this period.

Fig. 1
figure 1

Temporal variation of the BIA averaged over each grid for the ≥ 1 mm precipitation in summers from 1999 to 2010 based on the different probabilistic threshold forecasts of the S2S multiple models. N is the total number of ensemble members from each model, and the numbers in the legend indicate the numbers of ensemble members predicting the occurrence of rainfall event. The BIA > 1.0 (BIA < 1.0) means overestimation (underestimation) of precipitation frequency

Second, within the reasonable range of the forecasting frequency bias, the optimal probabilistic forecasting threshold Pthreshold is defined when the skill of the daily precipitation prediction is highest during 12 years at each grid point. Here, the skill is examined by using Equitable Threat Score (ETS; Schaefer 1990; see Appendix 1 for details). The calculation formula is as follows.

$$ {\text{P}}_{{{\text{~threshold}}}} = \left\{ {\begin{array}{*{20}c} {{\text{P}}_{{{\text{min}}}} ,~~BIA\left( {{\text{P}}_{{{\text{min}}}} } \right) \le \beta } \\ {{\text{P}}_{{\text{c}}} ,~~ETS\left( {{\text{P}}_{{\text{c}}} } \right) = Max\left( {{\text{ETS}}\left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\text{P}}_{{\min }} } \\ . \\ . \\ {{\text{P}}_{{\text{c}}} } \\ . \\ . \\ \end{array} } \\ {{\text{P}}_{{\max }} } \\ \end{array} } \right)} \right)~} \\ {{\text{P}}_{{{\text{max}}}} ,~~BIA\left( {{\text{P}}_{{{\text{max}}}} } \right) \ge \alpha } \\ \end{array} } \right. $$
(1)

In the equation, \({\mathrm{P}}_{\mathrm{min}}\) and \({\mathrm{P}}_{\mathrm{max}}\) are the minimum and maximum values of \({\mathrm{P}}_{\mathrm{ threshold}}\), respectively (e.g. \({\mathrm{P}}_{\mathrm{min}}\)\({\mathrm{P}}_{c}\)\({\mathrm{P}}_{\mathrm{max}}\)), when the \(BIA({\mathrm{P}}_{c})\) scores are within the reasonable range. After calculating the \({\mathrm{P}}_{\mathrm{ threshold}}\) at each grid, the spatio-temporal distribution of \({\mathrm{P}}_{\mathrm{ threshold}}\) in different regions can be achieved.

Third, according to the spatio-temporal distribution characteristics of the optimal probabilistic threshold, the threshold of the credible ensemble number (Nthreshold) is selected. Nthreshold = Pthreshold × n, where n is the total number of ensemble members. Then, whether the forecasted rainfall event occurs or not is redefined, that is, the forecasted rainfall event occurs when the number of ensemble members that predict the rainfall event (N) is greater than or equal to the Nthreshold. Otherwise, the forecasted rainfall event does not occur. The formulas are as follows.

$${\mathrm{A}}_{\mathrm{ DEFOPT}}={\mathrm{A}}_{\mathrm{threshold}}\times \phi $$
(2)
$$\phi =\left\{\begin{array}{c}1, \, N\ge {\mathrm{N}}_{\mathrm{ threshold}}\\ \\ 0, \, N<{\mathrm{N}}_{\mathrm{ threshold}}\end{array}\right.$$
(3)

In the equation, Athreshold is the amount of rainfall at a certain threshold (for example, ≥ 1 mm). ADEFOPT is the final result of ensemble forecasting at this threshold. ϕ indicates whether the precipitation event occurs (1) or not (0). If N is greater than or equal to Nthreshold, \(\phi \) is 1. Otherwise, \(\phi \) is 0.

To evaluate the benefits of the DEFOPT method, the temporal variation of the BIA (averaged over the rainfall events in the re-forecast length at each grid point) over China for the ≥ 1 mm and ≥ 5 mm precipitation in summers from 1999 to 2010 in China are shown based on the Pc of the nine S2S models (Fig. 1). For the ≥ 1 mm rainfall, the deviation between the high-threshold (e.g. 11 ensemble members for e.g. ECMWF) and low-threshold (e.g. 1 ensemble member) probabilistic forecasts of the S2S models increases rapidly within 10 days and tends to be steady after 10 days, and the forecast with the low (high) probabilistic threshold significantly overestimates (i.e. BIA > 1.0) [underestimates (i.e. BIA < 1.0)] the observed rainfall. The range of the corresponding BIA is significantly larger than that in the early stage of the forecast. The ranges of the BIAs of the ECMWF, NCEP, CMA, JMA, and ECCC models increase from 0.5–1.5 to 0.2–2.0. The ranges change from 1.0–1.5 to 0.3–2.0 for the UKMO and KMA models, increase from 1.3–1.7 to 0.1–3.0 for the CNRM model, and change from 0.1–1.7 to 0.0–2.0 for the BoM model. It is similar for the 5 mm precipitation, where the corresponding BIA begins to deviate from one standard deviation after 5 days (Fig. S1). Based on the temporal variation characteristics of BIA, the limitation of the BIA for the Pc of each S2S model is given to avoid excessive overestimation or underestimation: \(\mathrm{\alpha }\le \mathrm{BIA}\left({\mathrm{P}}_{\mathrm{c}}\right)\le\upbeta \). The values of the empirical coefficients α and β are shown in Table 2, but the variability of α and β with lead time is not considered in this study.

Table 2 Empirical coefficients α and β for precipitation events with different thresholds forecasted by the S2S models

Within the proper range of the BIAs, the spatio-temporal distribution of the Pthreshold with the highest ETS for the ≥ 1 mm precipitation prediction is calculated according to the S2S re-forecast data of daily precipitation in summers during 1999–2010 in China (Fig. 2). Figure 2a shows the average spatial distribution of the Pthreshold from the 11th to the 15th day for the ≥ 1 mm rainfall in the summers of China predicted by the ECMWF model with eleven ensemble members. The Pthreshold is within 30%–40% in most areas of northern China, central China, and eastern China, and within 50%–70% in some areas of southern China, southwestern China, and the southern part of the Qinghai-Tibet Plateau. However, the Pthreshold in the arid and semi-arid areas in northwestern China is about 20%. The results of the JMA model with five ensemble members are similar to those of the ECMWF model, but the Pthreshold is slightly higher in northeastern China and the lower reaches of the Yangtze River, which is about 50%–60% (Fig. 2d). For the CMA, NCEP, and CNRM models, the Pthreshold is above 50% in the areas to the east of 110°E and exceeds 70% in the southern and some parts of northeastern China. It is within 20–30% in the arid and semi-arid areas in northwestern China and the upper reaches of the Yangtze River (Fig. 2b, c, and e). For the UKMO and KMA models containing three ensemble members, the Pthreshold stays about 70% in general (Fig. 2h, i). The Pthreshold of the ECCC model is around 60% in most areas of China, but relatively low in the reaches of the Yangtze River (about 40%, Fig. 2g). The Pthreshold of the BoM model with 33 ensemble members is within 10%–20% in the arid and semi-arid areas and central China, which is significantly lower than other models (Fig. 2f). Figure 3 further shows the average spatial distribution of the Pthreshold from the 25th to the 30th day for the S2S models. It shows that the spatial variation of the Pthreshold for each model is similar in the different pentads. In addition, the average spatial distributions of the Pthreshold in other pentads within 30 days based on the S2S models are compared and analyzed (Fig. S2–S5). The results also show that the spatial distributions of Pthreshold in each model are similar on the sub-seasonal scale (6–30 days) with only a 10% difference between each pentad. Here, in order to display the main tendency of Pthreshold and filter the high-frequency information, we do not show the optimal probabilistic threshold day by day.

Fig. 2
figure 2

Average spatio-temporal distribution of the Pthreshold with the highest ETS of the 1 mm precipitation from the 11th to the 15th day (the third pentad) calculated by using the historical hindcast data of daily precipitation in summers from 1999 to 2010 in China based on the S2S models. af Results of the ECMWF, NCEP, CMA, JMA, CNRM, BoM, UKMO, ECCC, and KMA models, respectively

Fig. 3
figure 3

Same as Fig. 2, but for the 26th to the 30th day

The pentad spatial distributions of the Pthreshold of the ≥ 5 mm rainfall within 30 days are further analyzed. The 3rd pentad-averaged Pthreshold for the ECMWF model is within 20%–40% in most areas of China, but within 50–70% in the part of the Qinghai-Tibet Plateau (Fig. 4a). For the NCEP, UKMO and KMA models, the Pthreshold is above 50% in the southern and some parts of northeastern China, and close to 80%–90% over the northern Qinghai-Tibet Plateau (Fig. 4b, h and i). The results of the CMA and JMA are similar to those of the above three models, but the Pthreshold is lower than 50% in southern China (Fig. 4c, d). For the CNRM and ECCC models, the Pthreshold is generally 30%–40% in the areas to the east of 110°E and exceeds 50% in the other areas (Fig. 4e, g). The Pthreshold of the BoM model is generally lower than other models over China, especially 10–20% in central China and the arid and semi-arid areas (Fig. 4f). For each S2S model, the results of the spatial distributions of Pthreshold of the ≥ 5 mm rainfall in the 3rd pentad are also similar to other pentads during 6–30 days (Fig. S6–S10).

Fig. 4
figure 4

Same as Fig. 2, but for the ≥ 5 mm rainfall

Therefore, the spatio-temporal distribution characteristics of the Pthreshold enable us to select the credible ensemble members by using the Pthreshold calculated from numerous hindcast results. Then, the DEFOPT ensemble forecast can be constructed to carry out the deterministic forecasts for the precipitation events with different intensities, for example, 1–5 mm. As compared to the DEFOPT, the DEFPT method proposed in our previous work (Jie et al. 2014) predicts “yes” or “no” occurrence of rainfall event with a given intensity only by judging whether or not the forecast probability exceeds a constant threshold without spatio-temporal variability. To demonstrate the added value of DEFOPT, we therefore compare the DEFOPT (spatio-temporally variable threshold) with the DEFPT (constant threshold), as well as with the deterministic forecast from control run (CTL) and the classical ensemble mean (ENS mean).

3 Verification and evaluation of the DEFOPT method

Based on the spatio-temporal variation characteristics of the Pthreshold of the ≥ 1 mm precipitation during 1999–2010 in the S2S models, the DEFOPT method is applied to each S2S model to predict the ≥ 1 mm daily rainfall in the summer of 1999–2010 over China. The quantitative objective precipitation evaluation results from the ETS and Hanssen-Kuipers scores (abbreviated as HK, Hanssen and Kuipers 1965, see Appendix 1 for details) indicate that the DEFOPT is outperforming the CTL and the ENS mean at the lead time 0–30 days in each S2S model, and also better than majority of forecasts using the different probabilistic thresholds in forecasting the ≥ 1 mm rainfall in the NCEP, CMA, JMA, BoM, UKMO, ECCC and KMA models, although the performance of the DEFOPT is not better than the forecasts produced by using 5/11 and 6/11 probabilistic thresholds in the ECMWF model and 10/15 and 11/15 in the CNRM model after 10 days (Figs. 5 and 6). Generally, the corresponding ETS and HK scores can increase by about 20% by using the DEFOPT compared to the CTL and ENS mean methods. Meanwhile, the BIAs reveals that the frequency bias of DEFOPT is smaller than the ENS mean and most of the forecasts using the probabilistic thresholds during the sub-seasonal range in each S2S model, although the CTL is better than the DEFOPT for many models (Fig. 7). The DEFOPT’s BIA scores are not far away from 1.0, and the values are approximately equal to 1.3.

Fig. 5
figure 5

The ETS of the ≥ 1 mm daily precipitation in the summer of 1999–2010 in China forecasted by the S2S models at lead time 0–30 days. The black solid line, colored markers, blue solid line, and red solid line represent the results of the control run, the forecast by using different probabilistic thresholds, the ensemble mean and the DEFOPT, respectively. N is the total number of ensemble members from each model, and the numbers in the legend indicate the numbers of ensemble members predicting the occurrence of rainfall event. The higher ETS score, the better prediction

Fig. 6
figure 6

Same as Fig. 5, but for the HK

Fig. 7
figure 7

Same as Fig. 5, but for the BIAs. The long black dotted line is the standard of the BIA that equals 1.0. The BIA > 1.0 (BIA < 1.0) means overestimation (underestimation) of precipitation frequency

For the ≥ 5 mm rainfall, the skill of the DEFOPT for all S2S models is substantially higher than that of the CTL, ENS mean and the forecasts by using different probabilistic thresholds in general, as the corresponding ETS (Fig. 8) and HK (Fig. S11) is highest, and the BIA is close to or slightly larger than that of ENS mean (Fig. S12).

Fig. 8
figure 8

Same as Fig. 5, but for the ≥ 5 mm

The DEFOPT method is further evaluated for the prediction of the frequencies of the daily ≥ 1 mm and ≥ 5 mm daily rainfall events within each pentad and 10-day periods in summer (Fig. 9). The Pearson correlation between the number of observed and forecasted ≥ 1 mm rainfall days in each pentad from each S2S model in summers during 1999–2010 shows that the ensemble forecasting skill of the DEFOPT (red solid line) is higher than that of the CTL (black solid line), the ENS mean (black dotted line), and the DEFPT using the same probabilistic threshold for the entire region (blue solid line). The corresponding correlation coefficients increase by about 0.1–0.2, 0.05–0.1, and 0.05, respectively. The predictions of the pentad frequency for the ≥ 5 mm rainfall events show that the DEFOPT method (marked red solid line) can improve the forecast skills within 30 days for each S2S model compared to other ensemble forecasting methods. The improvement is particularly large relative to CTL (marked black dotted line) and ENS mean (marked blue solid line). The corresponding correlation coefficient increases by about 0.1–0.2 (0.05–0.1) compared to that of CTL (ENS mean).

Fig. 9
figure 9

Temporal variation of the correlation coefficient between the observed and forecasted frequencies of days with ≥ 1 mm (dash colored lines) and ≥ 5 mm (the marked lines) precipitation by using the S2S multiple models in each pentad in summers during 1999–2010. The different color lines represent the results of the CTL, the ENS, the DEFPT (the most skillful forecast by using a probabilistic threshold), and the DEFOPT, respectively

Based on the maximum lead time provided by each S2S model (Table 1), the forecast skills for the frequency of the daily ≥ 1 mm and ≥ 5 mm precipitation in each period of ten days at a longer time range are evaluated (Figs. 10). The results show that the DEFOPT method (red solid line) can significantly improve the sub-seasonal to seasonal forecast skills of each S2S model compared to the CTL (black solid line) by about 0.1–0.2, and the ENS mean (black dotted line) and DEFPT (blue solid line) methods. In addition, it was noticed that the ensemble mean forecast skill for the CNRM model with fifteen ensemble members is lower than that of the CTL in the ≥ 1 mm rainfall prediction (Fig. 10e). This could be caused by the overestimation of the rainfall intensity by most ensemble members (the BIA score is high, Fig. 7e) which leads to a substantial increase of the false forecast for the ≥ 1 mm rainfall events when using the ENS mean. For the ≥ 5 mm rainfall, the performance of the DEFOPT in each model is much better than that of ≥ 1 mm rainfall with the correlations increasing by about 0.1 – 0.2 from all the other methods (marked lines).

Fig. 10
figure 10

Same as Fig. 9, but for the correlation coefficient between the observed and forecasted frequency of precipitation events every ten days

In order to further verify the applicability of the DEFOPT method, the frequencies of the daily ≥ 1 mm and ≥ 5 mm precipitation in each period of ten days are evaluated during other re-forecast periods excluding 1999–2010 (Table 3). Except for the NCEP model, the other eight models have at least 8 years samples for evaluation. Whether for ≥ 1 mm or ≥ 5 mm rainfall, the DEFOPT is still better than other methods in most S2S models (Fig. 11).

Table 3 The analyzed S2S models outside the period 1999–2010
Fig. 11
figure 11

Same as Fig. 10, but for the other re-forecast periods excluding 1999–2010

We further analyzed the performance of individual ensemble member chosen by the DEFOPT at the different lead times. Figure 12 shows the proportion of each ensemble member predicting the occurrence of ≥ 1 mm rainfall event in all ensemble members chosen by the DEFOPT in the S2S models during the summer of 1999–2010 over China. It is clear that the proportion of individual ensemble member in the ECMWF, NCEP, CMA, JMA, CNRM, KMA, UKMO and BoM is similar or shows a slight fluctuation in the different lead times at 5-days intervals. It indicates that the ensemble members selected by the DEFOPT are random. However, the proportions of the member 2 and 4 in the ECCC are higher than the member 1 and 3. It is possible that the member 2 and 4 have systematic biases. The similar result can also be found in the ≥ 5 mm rainfall event (Fig. S13).

Fig. 12
figure 12

The proportion of each ensemble member predicting the occurrence of ≥ 1 mm rainfall event in the ensemble members chosen by the DEFOPT in the S2S models in the different lead times at 5-days intervals during the summer of 1999–2010 over China

4 Discussion and conclusion

The DEFOPT method is proposed to choose credible ensemble members for the sub-seasonal to seasonal prediction of precipitation in this paper. It uses the spatio-temporal distribution characteristics of the optimal probabilistic threshold which were proven to exist in the climatology as the standard to decide how many ensemble members should be trusted on the S2S scale. The optimal ensemble strategy is made for S2S precipitation prediction by following 3 steps:

  1. 1.

    Based upon hindcasts with long period, exclude the probabilistic thresholds with large frequency biases by using the limitation of BIA score at each grid point.

  2. 2.

    Find out a most skillful probabilistic threshold (with the highest ETS) from the leftover probabilistic thresholds (after step 1) via the ETS score at each grid to generate a climatological spatio-temporal distribution of the optimal probabilistic threshold.

  3. 3.

    Determine the number of skillful ensemble members in the real-time prediction by judging whether the number is greater than or equal to the optimal probabilistic threshold or not, based upon the spatio-temporal distribution characteristics of the optimal probabilistic threshold from the climatology.

Here, all these steps are just part of the post-processing based on 12 years of hindcasts and are not part of the numerical modeling integration. By using Fortran codes on a regular UNIX workstation, the process of selecting the credible ensemble members (including step 1 and 2) spends about 2 min (clocktime) on a model with horizontal resolution 1.5º × 1.5º and ~ 10 ensemble members over China during 12 hindcast years, and about 30 s on generating an adjusted real-time forecast (step 3). Thus, the DEFOPT will not be computationally expensive in the operational application.

In this work, the quantitative objective evaluation scores including ETS, HK and BIA widely used to evaluate model precipitation forecasts (Accadia et al. 2010; Weusthoff et al. 2010) are selected. All these scores are constructed by hits, false alarms and misses of rain event forecast, and no-rain event accurate forecast (correct reject) as shown in Table 4, respectively. The ETS and HK scores focus on the prediction skill of rainfall and no rainfall events, meanwhile the BIA score shows the overestimation or underestimation of the frequency of rainfall events. The evaluation results of the application of the DEFOPT method on the nine S2S operational models show that this methodology can substantially improve the forecast skill of precipitation events with 1 mm and 5 mm thresholds in the S2S summer over China, and its skill is better than that of the CTL, the ENS mean, and the DEFPT as shown in ETS and HK evaluation methods. Meanwhile, the frequency bias of the DEFOPT for the ≥ 1 mm precipitation is smaller than the ENS mean, although it is close to or slightly larger than the CTL. For the ≥ 5 mm precipitation, the DEFOPT frequency bias is generally greater than the ENS mean and the CTL, but is not far away from them. The main reason for the improvements is that the DEFOPT can substantially increase the hit rates of ≥ 5 mm rainfall, although slightly increase the false alarm rates in part of S2S models (such as ECMWF, JMA, CNRM and BoM) as compared to the traditional ensemble mean method, the control run; and it also can increase the hits or decrease the false alarms in comparison to the DEFPT using a uniform probabilistic threshold in most S2S models (Fig. 13). For the low intensity rainfall (e.g. ≥ 1 mm), the DEFOPT can substantially decrease the false alarms compared to the ENS which shows not only high hits but also too high false alarms in most S2S models, and it is also more close to the left top corner of subplot compared with the DEFPT in each S2S model except the ECMWF (Fig. S14).

Table 4 2 × 2 matrix for a precipitation event with a certain threshold
Fig. 13
figure 13

The averaged Relative Operating Characteristic (ROC; Jolliffe and Stephenson 2003) curves of the predictions for the fourth pentad-averaged ≥ 5 mm precipitation during the summer of 1999–2010 over China. The black square indicates the CTL, the blue square larger than other blue squares means the DEFPT using a given probabilistic threshold, and the colored circles are the ENS and DEFOPT, respectively. The N is the total number of ensemble members and the numbers in subplot are the thresholds of ensemble member number

As compared to some ensemble reduction techniques (reduction by ‘‘uncorrelation’’ method, by principal component analysis, etc.) in recent 10 years (Knutti 2010; Knutti et al. 2017; Riccio et al. 2012; Sanderson et al. 2015; Stein et al 2015; Mendlik and Gobiet 2016; Dalelane et al. 2018), which are proposed for weather forecasting, seasonal prediction or climate projection to make optimal use of the information inherent in the full ensemble, the DEFOPT does not essentially reduce ensemble size or discard any ensemble member, but only makes use of the information of the optimal ensemble members to determine whether a forecasting event occurs or not in a given region.

It is notable that the calculation of the optimal probabilistic threshold in DEFOPT may be slightly affected by the length of the hindcast period (for example, only 12 years in this study). In addition, it is found that the α and β in Formula 3 are dependent on the model in the selection of the Pthreshold due to the different systematic forecasting biases in different models, and these coefficients can be considered to change with lead time, which will be investigated in our future work. There may be some potential application values of the DEFOPT for the multi-model ensemble. When the ensemble mean from each model is considered as an individual ensemble member, this method could determine how many models can be trusted in different regions and lead times to achieve a best performance of S2S multi-models precipitation prediction. Moreover, only the DEFOPT predictions for the summer precipitation over China has been investigated in this study, and its application to other areas, other seasons or other variables (air temperature, anomaly of height, etc.) can be evaluated in the future.