How to choose credible ensemble members for the sub-seasonal to seasonal prediction of precipitation?

The sub-seasonal to seasonal (S2S) prediction of precipitation is not only a hot topic but also a challenge. The traditional ensemble mean and ensemble probabilistic forecast methods cannot avoid the uncertainty of the initial value in the S2S prediction. Is there a more suitable ensemble postprocessing method for the S2S prediction? In this study, the hindcast data during the 1999–2010 summers from nine operational models in the international S2S prediction project has been evaluated. Based on the quantitative objective precipitation evaluation methods, such as the Equitable Threat Score and frequency bias methods, the climatological spatio-temporal distribution of the optimal probabilistic threshold on the S2S scale is proven to exist, and it can be used as the standard to judge how many ensemble members are credible. Then, different ensemble forecast strategies are adopted in different regions to construct a Deterministic Ensemble Forecast using an Optimal Probabilistic Threshold (DEFOPT) method for precipitation prediction. The hindcast data of eight S2S models outside the period 1999–2010 are used to verify the applicability of the DEFOPT method by using the historical optimal probabilistic threshold during 1999–2010. The results show that the DEFOPT outperforms the deterministic forecast from one initial value, the ensemble mean, and the deterministic ensemble forecast using a probabilistic threshold for the occurrence days of rainfall at the 1 mm and 5 mm thresholds (≥ 1 mm and ≥ 5 mm) over China during each pentad in most S2S models.


Introduction
Meteorological disasters caused by extreme weather and extreme climate events have become one of the major problems faced by human beings since the twentieth century. For example, the flood disaster in the Yangtze-Huaihe River Basin of China in the summer of 1998 and the freezing rain and snow disaster in southern China in the winter of 2008 (National Climate Center 1998;Li and Gu 2010) led to great losses to the national economic and social development. Currently, numerical models have made great progress in the medium-range and short-term weather forecasts and the forecasts longer than the seasonal scale. However, there is still a gap between the 2-week forecasts and seasonal forecasts. Therefore, the 15-60 day sub-seasonal to seasonal (S2S) prediction is getting increasing attention worldwide Zhou et al. 2019).
In 2015, the World Weather Research Program and the World Climate Research Program jointly launched an international project of S2S prediction . This project aims to enhance the 15-60 day prediction skills, improve the overall understanding of high impact weather events, such as the tropical low-frequency Madden-Julian Oscillation, monsoon, and extreme precipitation, and promote the relevant research carried out by international operational forecast centers and institutions. At present, the operational forecast centers in 11 countries have participated to the S2S project and released a large number of historical hindcast and real-time forecast data of the models (http:// www. s2spr edict ion. net/). The China Meteorological Administration (CMA) also participated in the S2S project, submitted the experiment data based on the Beijing Climate Center Climate System Model (BCC_CSM1.2) of the National Climate Center, and undertook the task of Asian database 1 3 center for the S2S project. As one of the important issues in this S2S project, the sub-seasonal prediction (15-60 day) of precipitation has attracted wide attention (Ebert 2001). However, the understanding of the predictability and prediction method of precipitation is limited.
Nowadays the weather numerical model performs better in the daily forecast of the geopotential height, air temperature, and precipitation in the leading time of about a week (Saha and Van den Dool 1988;Qin and Van den Dool 1996;Buizza 2008;Schmeits and Kok 2010). Considering the interactions of the atmosphere with the ocean, land surface, and sea ice, the climate model can predict the average and variability of meteorological factors longer than the seasonal scale (Collins and Coauthors 2006;Wu et al. 2013). However, due to the chaos in the atmosphere (Lorenz 1963(Lorenz , 1982Chou 1989;Hoffman 2002), the inevitable initial value errors and model errors lead to forecast biases in the weather and climate models. Thus, the ensemble mean and probabilistic forecast based on multiple initial values or multiple models are usually carried out to represent the forecast uncertainty caused by one initial value and model errors (Gneiting and Raftery 2005). In the last 20 years, the ensemble forecasting methods such as the Monte Carlo forecast method (Leith 1974), Time-Lagged Average Forecast method (LAF, Hoffman and Kalnay 1983), breeding growing mode method Kalany 1993, 1997), singular vectors method (Molteni et al. 1996), ensemble Kalman filter method (Houtekamer and Mitchell 1998), stochastically perturbed parameterization tendencies method (Buizza et al. 1999a, b), multi-model ensemble prediction method (Fritsch et al. 2000), and machine learning approach (Hwang et al. 2019) have gradually become important tools to improve the skill of the weather forecasts, S2S prediction, and even long-term climate change simulation in the national operational centers, which include the National Centers for Environmental Prediction (NCEP), the European Center for Medium-Range Weather Forecasts (ECMWF), the United Kingdom Met Office (UKMO), the Japan Meteorological Agency (JMA), and the Chinese Meteorological Administration (CMA) (Sivillo et al. 1997;Moore and Kleeman 1998;Krishnamurti et al. 2000;Yang 2001;Buizza 2019;Zhang et al. 2021).
For the prediction from day 7 to 60, previous studies have discussed the influence of ensemble forecasting methods, such as the ensemble probabilistic prediction (Pan and Van den Dool 1998;Chessa and Lalaurette 2001), the conditional nonlinear optimal perturbation ensemble (defined as a kind of initial perturbation which makes the cost function acquire their maximum under an initial constraint condition; Jiang et al. 2009), the weather type ensemble forecast (designed for the purpose of post-processing forecast output from ensemble prediction systems and understanding how forecast models perform under different circulation types ;Neal et al. 2016), predictability-based extended-range ensemble prediction (proposed for the predictable components and random components obtained with different ensemble prediction strategies; Zheng et al. 2012), on the prediction skill of geopotential height, wind, and air temperature.
Other studies have analyzed and evaluated the impact of ensemble forecast methods on the sub-seasonal precipitation prediction skill (Hamill et al. 2004;Whitaker et al. 2006;Vitart and Molteni 2009;Jie et al. 2013;Bombardi et al. 2017;Liang and Lin 2018;Li et al. 2019). However, the improvement of weekly to sub-seasonal precipitation prediction is limited compared to other time-scale forecasts (Tan and Chen 2013;Jie et al. 2013), and the statistical postprocessing ensemble of precipitation is far more challenging than that of weather variables like surface temperature or wind speed (Scheuerer 2014). At present, both the ensemble probabilistic forecast method and ensemble mean method are not good enough in the day 7 to sub-seasonal precipitation prediction due to the excessive increase of the ensemble spread after 1 week (Jie et al. 2014). For the ensemble probabilistic forecast, a few of the ensemble probabilistic thresholds become less skillful with lead times (Buizza et al. 1999a, b;Hamill et al. 2008). In the late period of forecasting (high lead times), the precipitation is significantly underestimated (overestimated) by the forecast with a high (low) probabilistic threshold (Jie et al. 2014). For the ensemble mean forecasting, the precipitation bias is more likely to be caused by the false extreme precipitation predicted by a certain ensemble member if the ensemble size is not large enough (Jie et al. 2014). Considering this, Jie et al. (2014) proposed a method of Deterministic Ensemble Forecast using a Probabilistic Threshold (DEFPT), which selects ensemble members through a certain ensemble probabilistic threshold. It can greatly improve the 6-15 day forecast skill in summer precipitation of different intensities in China, although the spatio-temporal variation of the probabilistic threshold is not considered and only the applicability of this method in a time-lagged ensemble system is verified. Meanwhile, it can avoid the influence of the false extreme value of the precipitation forecasted by a certain ensemble member on the ensemble forecasting. However, to some extent, there are still deviations in the precipitation prediction by using the DEFPT method based on a same probabilistic threshold from ensemble members in different regions, which may be related to the different regional systematic forecast errors of the model.
In this work, the quantitative objective statistical methods are used to explore the spatio-temporal variation of the available probabilistic forecast information in the sub-seasonal forecast. The credible ensemble members (i.e. smaller biases and more skillful members) are selected, based on the spatio-temporal variation, and the optimal ensemble strategy is provided for different regions to be used in the S2S precipitation ensemble forecast. The applicability of the method in different S2S operational models is verified. This article is organized as follows: Model data and ensemble methods are introduced in Sect. 2; the verification and evaluation of the ensemble methods are provided in Sect. 3; the results are explored in Sect. 4; and Sect. 5 provides a summary and discussion.

Model data
In this study, the precipitation data from hindcast experiments from eleven operational prediction models in the S2S project are used (Table 1). All the data cover the period of 1999 to 2010, and are downloaded from http:// www. s2spr edict ion. net/. Although the S2S models have different horizontal resolutions, each operational center uploaded model output to the S2S database archiving centers with a unified horizontal resolution 1.5º × 1.5º except the BoM model with a lower resolution 2.5º × 2.5º. As the Institute of Atmospheric Sciences and Climate of the Italian National Research Council only provided a single re-forecast sample and there are some errors in the ensemble forecast data submitted by the Hydrometeorological Center of Russia (Jie et al. 2017), the ensemble forecasting methods are evaluated and analyzed based on only nine operational models. In this study, the observed rainfall-gauge data over China are interpolated to the corresponding horizontal resolution of each S2S model, and daily accumulated precipitation from each S2S model is analyzed. In addition, eight models with longer re-forecast length than the NCEP (1999-2010) are further examined (Table 3).

Ensemble methods
In this paper, the Deterministic Ensemble Forecast using an Optimal Probabilistic Threshold (DEFOPT) method is proposed for the S2S (15-60 days) precipitation prediction. The DEFOPT is different from the traditional probabilistic forecast method as it does not predict the probability of precipitation event in each grid, but it uses the available probabilistic forecasting information in the S2S real-time scale to decide how many ensemble members predicting the occurrence of rainfall event should be trusted, and then determines the optimal ensemble forecast in different regions. The details are as follows.
First, in order to avoid the excessive overestimation or underestimation in the ensemble probabilistic forecast of the rainfall event at a certain intensity (the precipitation with the threshold of 1 mm) at each grid point, the rainfall forecasting frequency bias for the probabilistic threshold P c is limited by a quantitative objective evaluation method-BIA score (see Appendix 1 for details) based on the multi-year hindcast results. This limitation is α ≤ BIA P c ≤ β , where α and β are empirical coefficients artificially selected according to the BIAs of the forecasts using different probabilistic thresholds from each model (e.g. Fig. 1 for ≥ 1 mm; Fig. S1 for ≥ 5 mm in the supplementary material). In this study, α and β are first tuned to have good performances of the DEFOPT in the climatology, and then are used to examine the hindcasts outside this period.
Second, within the reasonable range of the forecasting frequency bias, the optimal probabilistic forecasting threshold P threshold is defined when the skill of the daily precipitation prediction is highest during 12 years at each grid point. Here, the skill is examined by using Equitable Threat Score (ETS; Schaefer 1990; see Appendix 1 for details). The calculation formula is as follows.  In the equation, P min and P max are the minimum and maximum values of P threshold , respectively (e.g. P min ≤ P c ≤ P max ), when the BIA(P c ) scores are within the reasonable range. After calculating the P threshold at each grid, the spatio-temporal distribution of P threshold in different regions can be achieved.
Third, according to the spatio-temporal distribution characteristics of the optimal probabilistic threshold, the threshold of the credible ensemble number (N threshold ) is selected. N threshold = P threshold × n, where n is the total number of ensemble members. Then, whether the forecasted rainfall event occurs or not is redefined, that is, the forecasted rainfall event occurs when the number of ensemble members that predict the rainfall event (N) is greater than or equal to the N threshold . Otherwise, the forecasted rainfall event does not occur. The formulas are as follows.
In the equation, A threshold is the amount of rainfall at a certain threshold (for example, ≥ 1 mm). A DEFOPT is the final result of ensemble forecasting at this threshold. ϕ indicates whether the precipitation event occurs (1) or not (0). If N is greater than or equal to N threshold , is 1. Otherwise, is 0.
To evaluate the benefits of the DEFOPT method, the temporal variation of the BIA (averaged over the rainfall events in the re-forecast length at each grid point) over China for the ≥ 1 mm and ≥ 5 mm precipitation in summers from 1999 to 2010 in China are shown based on the P c of the nine S2S models (Fig. 1). For the ≥ 1 mm rainfall, the deviation between the high-threshold (e.g. 11 ensemble members for e.g. ECMWF) and low-threshold (e.g. 1 ensemble member) probabilistic forecasts of the S2S models increases rapidly within 10 days and tends to be steady after 10 days, and the forecast with the low (high) probabilistic threshold significantly overestimates (i.e. BIA > 1.0) [underestimates (i.e. BIA < 1.0)] the observed rainfall. The range of the corresponding BIA is significantly larger than that in the early stage of the forecast. The ranges of the BIAs of the ECMWF, NCEP, CMA, JMA, and ECCC models increase from 0.5-1.5 to 0.2-2.0. The ranges change from 1.0-1.5 to 0.3-2.0 for the UKMO and KMA models, increase from 1.3-1.7 to 0.1-3.0 for the CNRM model, and change from 0.1-1.7 to 0.0-2.0 for the BoM model. It is similar for the 5 mm precipitation, where the corresponding BIA begins to deviate from one standard deviation after 5 days (Fig. S1). Based on the temporal variation characteristics of BIA, the limitation of the BIA for the P c of each S2S model is given to avoid excessive overestimation or underestimation: α ≤ BIA P c ≤ β . The values of the empirical coefficients α and β are shown in Table 2, but the variability of α and β with lead time is not considered in this study.
Within the proper range of the BIAs, the spatio-temporal distribution of the P threshold with the highest ETS for the ≥ 1 mm precipitation prediction is calculated according to the S2S re-forecast data of daily precipitation in summers during 1999-2010 in China (Fig. 2). Figure 2a shows the average spatial distribution of the P threshold from the 11th to the 15th day for the ≥ 1 mm rainfall in the summers of China predicted by the ECMWF model with eleven ensemble members. The P threshold is within 30%-40% in most areas of northern China, central China, and eastern China, and within 50%-70% in some areas of southern China, southwestern China, and the southern part of the Qinghai-Tibet Plateau. However, the P threshold in the arid and semi-arid 1 mm 0.9-1.6 0.9-1.6 1.2-1.6 0.9-1.6 5 mm 1.5-3.0 1.6-3.5 1.3-3.5 1.7-3.5 areas in northwestern China is about 20%. The results of the JMA model with five ensemble members are similar to those of the ECMWF model, but the P threshold is slightly higher in northeastern China and the lower reaches of the Yangtze River, which is about 50%-60% (Fig. 2d). For the CMA, NCEP, and CNRM models, the P threshold is above 50% in the areas to the east of 110°E and exceeds 70% in the southern and some parts of northeastern China. It is within 20-30% in the arid and semi-arid areas in northwestern China and the upper reaches of the Yangtze River (Fig. 2b, c, and e).
For the UKMO and KMA models containing three ensemble members, the P threshold stays about 70% in general (Fig. 2h, i). The P threshold of the ECCC model is around 60% in most areas of China, but relatively low in the reaches of the Yangtze River (about 40%, Fig. 2g). The P threshold of the BoM model with 33 ensemble members is within 10%-20% in the arid and semi-arid areas and central China, which is significantly lower than other models (Fig. 2f). Figure 3 further shows the average spatial distribution of the P threshold from the 25th to the 30th day for the S2S models. It shows that the spatial variation of the P threshold for each model is similar in the different pentads. In addition, the average spatial distributions of the P threshold in other pentads within 30 days based on the S2S models are compared and analyzed (Fig. S2-S5). The results also show that the spatial distributions of P threshold in each model are similar on the sub-seasonal scale (6-30 days) with only a 10% difference between each pentad. Here, in order to display the main tendency of P threshold and filter the high-frequency information, we do not show the optimal probabilistic threshold day by day. The pentad spatial distributions of the P threshold of the ≥ 5 mm rainfall within 30 days are further analyzed. The 3 rd pentad-averaged P threshold for the ECMWF model is within 20%-40% in most areas of China, but within 50-70% in the part of the Qinghai-Tibet Plateau (Fig. 4a). For the NCEP, UKMO and KMA models, the P threshold is above 50% in the southern and some parts of northeastern China, and close to 80%-90% over the northern Qinghai-Tibet Plateau (Fig. 4b,   Fig. 2 Average spatio-temporal distribution of the P threshold with the highest ETS of the 1 mm precipitation from the 11th to the 15th day (the third pentad) calculated by using the historical hindcast data of daily precipitation in summers from 1999 to 2010 in China based on the S2S models. a-f Results of the ECMWF, NCEP, CMA, JMA, CNRM, BoM, UKMO, ECCC, and KMA models, respectively h and i). The results of the CMA and JMA are similar to those of the above three models, but the P threshold is lower than 50% in southern China (Fig. 4c, d). For the CNRM and ECCC models, the P threshold is generally 30%-40% in the areas to the east of 110°E and exceeds 50% in the other areas ( Fig. 4e, g). The P threshold of the BoM model is generally lower than other models over China, especially 10-20% in central China and the arid and semi-arid areas (Fig. 4f). For each S2S model, the results of the spatial distributions of P threshold of the ≥ 5 mm rainfall in the 3rd pentad are also similar to other pentads during 6-30 days (Fig. S6-S10).
Therefore, the spatio-temporal distribution characteristics of the P threshold enable us to select the credible ensemble members by using the P threshold calculated from numerous hindcast results. Then, the DEFOPT ensemble forecast can be constructed to carry out the deterministic forecasts for the precipitation events with different intensities, for example, 1-5 mm. As compared to the DEFOPT, the DEFPT method proposed in our previous work (Jie et al. 2014) predicts "yes" or "no" occurrence of rainfall event with a given intensity only by judging whether or not the forecast probability exceeds a constant threshold without spatio-temporal variability. To demonstrate the added value of DEFOPT, we therefore compare the DEFOPT (spatio-temporally variable threshold) with the DEFPT (constant threshold), as well as with the deterministic forecast from control run (CTL) and the classical ensemble mean (ENS mean).

Verification and evaluation of the DEFOPT method
Based on the spatio-temporal variation characteristics of the P threshold of the ≥ 1 mm precipitation during 1999-2010 in the S2S models, the DEFOPT method is applied to each S2S model to predict the ≥ 1 mm daily rainfall in the summer of 1999-2010 over China. The quantitative objective precipitation evaluation results from the ETS and Hanssen-Kuipers scores (abbreviated as HK, Hanssen and Kuipers 1965, see Appendix 1 for details) indicate that the DEFOPT is outperforming the CTL and the ENS mean at the lead time 0-30 days in Fig. 3 Same as Fig. 2, but for the 26th to the 30th day each S2S model, and also better than majority of forecasts using the different probabilistic thresholds in forecasting the ≥ 1 mm rainfall in the NCEP, CMA, JMA, BoM, UKMO, ECCC and KMA models, although the performance of the DEFOPT is not better than the forecasts produced by using 5/11 and 6/11 probabilistic thresholds in the ECMWF model and 10/15 and 11/15 in the CNRM model after 10 days (Figs. 5 and 6). Generally, the corresponding ETS and HK scores can increase by about 20% by using the DEFOPT compared to the CTL and ENS mean methods. Meanwhile, the BIAs reveals that the frequency bias of DEFOPT is smaller than the ENS mean and most of the forecasts using the probabilistic thresholds during the sub-seasonal range in each S2S model, although the CTL is better than the DEFOPT for many models (Fig. 7). The DEFOPT's BIA scores are not far away from 1.0, and the values are approximately equal to 1.3. For the ≥ 5 mm rainfall, the skill of the DEFOPT for all S2S models is substantially higher than that of the CTL, ENS mean and the forecasts by using different probabilistic thresholds in general, as the corresponding ETS (Fig. 8) and HK (Fig. S11) is highest, and the BIA is close to or slightly larger than that of ENS mean (Fig. S12).
The DEFOPT method is further evaluated for the prediction of the frequencies of the daily ≥ 1 mm and ≥ 5 mm daily rainfall events within each pentad and 10-day periods in summer (Fig. 9). The Pearson correlation between the number of observed and forecasted ≥ 1 mm rainfall days in each pentad from each S2S model in summers during 1999-2010 shows that the ensemble forecasting skill of the DEFOPT (red solid line) is higher than that of the CTL (black solid line), the ENS mean (black dotted line), and the DEFPT using the same probabilistic threshold for the entire region (blue solid line). The corresponding correlation coefficients increase by about 0.1-0.2, 0.05-0.1, and 0.05, respectively. The predictions of the pentad frequency for the ≥ 5 mm rainfall events show that the DEFOPT method (marked red solid line) can improve the forecast skills within 30 days for each S2S model compared to other ensemble forecasting methods. The improvement is particularly large relative to CTL (marked black dotted line) and ENS mean (marked Based on the maximum lead time provided by each S2S model (Table 1), the forecast skills for the frequency of the daily ≥ 1 mm and ≥ 5 mm precipitation in each period of ten Fig. 5 The ETS of the ≥ 1 mm daily precipitation in the summer of 1999-2010 in China forecasted by the S2S models at lead time 0-30 days. The black solid line, colored markers, blue solid line, and red solid line represent the results of the control run, the forecast by using different probabilistic thresholds, the ensemble mean and the DEFOPT, respectively. N is the total number of ensemble members from each model, and the numbers in the legend indicate the numbers of ensemble members predicting the occurrence of rainfall event. The higher ETS score, the better prediction days at a longer time range are evaluated (Figs. 10). The results show that the DEFOPT method (red solid line) can significantly improve the sub-seasonal to seasonal forecast skills of each S2S model compared to the CTL (black solid line) by about 0.1-0.2, and the ENS mean (black dotted line) and DEFPT (blue solid line) methods. In addition, it was noticed that the ensemble mean forecast skill for the CNRM model with fifteen ensemble members is lower than that of the CTL in the ≥ 1 mm rainfall prediction (Fig. 10e). This could be caused by the overestimation of the rainfall intensity by most ensemble members (the BIA score is high, Fig. 7e) which leads to a substantial increase of the false forecast for the ≥ 1 mm rainfall events when using the ENS mean. For the ≥ 5 mm rainfall, the performance of the In order to further verify the applicability of the DEFOPT method, the frequencies of the daily ≥ 1 mm and ≥ 5 mm precipitation in each period of ten days are evaluated during other re-forecast periods excluding 1999-2010 (Table 3). Except for the NCEP model, the other eight models have at least 8 years samples for evaluation. Whether for ≥ 1 mm or ≥ 5 mm rainfall, the DEFOPT is still better than other methods in most S2S models (Fig. 11).   Temporal variation of the correlation coefficient between the observed and forecasted frequencies of days with ≥ 1 mm (dash colored lines) and ≥ 5 mm (the marked lines) precipitation by using the S2S multiple models in each pentad in summers during 1999-2010. The different color lines represent the results of the CTL, the ENS, the DEFPT (the most skillful forecast by using a probabilistic threshold), and the DEFOPT, respectively member 2 and 4 in the ECCC are higher than the member 1 and 3. It is possible that the member 2 and 4 have systematic biases. The similar result can also be found in the ≥ 5 mm rainfall event (Fig. S13).  Fig. 9, but for the correlation coefficient between the observed and forecasted frequency of precipitation events every ten days

Discussion and conclusion
The DEFOPT method is proposed to choose credible ensemble members for the sub-seasonal to seasonal prediction of precipitation in this paper. It uses the spatio-temporal distribution characteristics of the optimal probabilistic threshold which were proven to exist in the climatology as the standard to decide how many ensemble members should be trusted on the S2S scale. The optimal ensemble strategy is made for S2S precipitation prediction by following 3 steps: 1. Based upon hindcasts with long period, exclude the probabilistic thresholds with large frequency biases by using the limitation of BIA score at each grid point. 2. Find out a most skillful probabilistic threshold (with the highest ETS) from the leftover probabilistic thresholds (after step 1) via the ETS score at each grid to generate a climatological spatio-temporal distribution of the optimal probabilistic threshold. 3. Determine the number of skillful ensemble members in the real-time prediction by judging whether the number is greater than or equal to the optimal probabilistic threshold or not, based upon the spatio-temporal distribution characteristics of the optimal probabilistic threshold from the climatology.
Here, all these steps are just part of the post-processing based on 12 years of hindcasts and are not part of the numerical modeling integration. By using Fortran codes on a regular UNIX workstation, the process of selecting the credible ensemble members (including step 1 and 2) spends about 2 min (clocktime) on a model with horizontal resolution 1.5º × 1.5º and ~ 10 ensemble members over China during 12 hindcast years, and about 30 s on generating an adjusted real-time forecast (step 3). Thus, the DEFOPT will not be computationally expensive in the operational application.
In this work, the quantitative objective evaluation scores including ETS, HK and BIA widely used to evaluate model precipitation forecasts (Accadia et al. 2010;Weusthoff et al. 2010) are selected. All these scores are constructed by hits, false alarms and misses of rain event forecast, and no-rain event accurate forecast (correct reject) as shown in Table 4, respectively. The ETS and HK scores focus on the prediction skill of rainfall and no rainfall events, meanwhile the BIA score shows the overestimation or underestimation of the frequency of rainfall events. The evaluation results of the application of the DEFOPT method on the nine S2S operational models show that this methodology can substantially improve the forecast skill of precipitation events with 1 mm and 5 mm thresholds in the S2S summer over China, and its skill is better than that of the CTL, the ENS mean, and the DEFPT as shown in ETS and HK evaluation methods. Meanwhile, the frequency bias of the DEFOPT for the ≥ 1 mm precipitation is smaller than the ENS mean, although it is close to or slightly larger than the CTL. For the ≥ 5 mm precipitation, the DEFOPT frequency bias is generally greater than the ENS mean and the CTL, but is not far away from them. The main reason for the improvements is that the DEFOPT can substantially increase the hit rates of ≥ 5 mm rainfall, although slightly increase the false alarm rates in part of S2S models (such as ECMWF, JMA, CNRM and BoM) as compared to the traditional ensemble mean method, the control run; and it also can increase the hits or decrease the false alarms in comparison to the DEFPT using a uniform probabilistic threshold in most S2S models (Fig. 13). For the low intensity rainfall (e.g. ≥ 1 mm), the DEFOPT can substantially decrease the false alarms compared to the ENS which shows not only high hits but also too high false alarms in most S2S models, and it is also more close to the left top corner of subplot compared with the DEFPT in each S2S model except the ECMWF (Fig. S14).
As compared to some ensemble reduction techniques (reduction by ''uncorrelation'' method, by principal component analysis, etc.) in recent 10 years (Knutti 2010;Knutti et al. 2017;Riccio et al. 2012;Sanderson et al. 2015;Stein et al 2015;Mendlik and Gobiet 2016;Dalelane et al. 2018), which are proposed for weather forecasting, seasonal prediction or climate projection to make optimal use of the information inherent in the full ensemble, the DEFOPT does not essentially reduce ensemble size or discard any ensemble member, but only makes use of the information of the optimal ensemble members to determine whether a forecasting event occurs or not in a given region.
It is notable that the calculation of the optimal probabilistic threshold in DEFOPT may be slightly affected by the length of the hindcast period (for example, only 12 years  1995199520119 NCEP 1999-2010--CMA 1994-201419942011-20149 JMA 1981-2010198118 CNRM 1993-201419932011-201410 BoM 19811981201121 UKMO 19931993201111 ECCC 1995-201419952011-20148 KMA 1991-20101991 in this study). In addition, it is found that the α and β in Formula 3 are dependent on the model in the selection of the P threshold due to the different systematic forecasting biases in different models, and these coefficients can be considered to change with lead time, which will be investigated in our future work. There may be some potential application values of the DEFOPT for the multi-model ensemble. When the ensemble mean from each model is

Author contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by WJ, TW, FV, XL, YL, JY and HZ. The first draft of the manuscript was written by WJ and TW, and FV commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Data availability
The datasets generated during the current study are available in the S2S project database center (http:// www. s2spr edict ion. net/).

Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.