1 Introduction

Seasonal Climate Forecasts (SCFs) provide a middle- to long-range outlook of changes in the Earth system over periods of a few weeks to several months, through predictable changes in some of the slow-varying components of the system, such as ocean temperatures (Johnson et al. 2019). Skilful seasonal precipitation forecasts provide invaluable support to a wide range of sectors, including agriculture, construction, mining, hydrology, and water resources management (Falamarzi et al. 2023; Jin et al. 2022; Merryfield et al. 2020; the Centre for International Economics 2014; Tian et al. 2021). For example, site-specific information on monthly rainfall for the growing season, typically up to three months ahead, can help farmers make informed decisions about which crop types or varieties to plant (Weisheimer and Palmer 2014). For Australia alone, the Centre for International Economics (2014) found that the potential annual value added from SCFs is around A$1.6 billion (or 7.31%) for the agricultural sector. Data-driven forecast models have limited forecast skills for precipitation beyond one month (Vivas et al. 2023), thus this study will focus on SCFs from process-based GCMs. A single deterministic SCF is in fact not enough to reflect the inherently chaotic nature of our climate system, climate service centres around the world are routinely providing ensemble SCFs with multiple forecast trajectories from Global Circulation Models (GCMs) with perturbed initial conditions and/or model specifications, or from multiple GCMs (Hudson et al. 2017; Johnson et al. 2019; Kirtman et al. 2014; MacLachlan et al. 2015; Merryfield et al. 2020). Ensemble forecasts, which are used in many other earth sciences studies such as hydrology (Troin et al. 2021; Tyralis and Papacharalampous 2021; Yilmaz et al. 2023), serve to enhance forecast skills.

Due to imperfect model structure and parameterisation such as atmosphere-land and atmosphere‐ocean coupling processes, GCMs usually do not simulate site-specific conditions efficiently, as the predictability from atmospheric initial conditions is lost after a few weeks (Merryfield et al. 2020). Thus, cost-effective statistical post-processing has become a standard practice to eliminate bias and improve forecast skills to provide fit-for-purpose information for localised decision-making (Monhart et al. 2018). Various post-processing models have been developed for long-term climate projection, SCF, and Numeric Weather Predictions (NWP) in particular (Maraun and Widmann 2018; Vannitsem et al. 2021).

In contrast to continuous variables like temperature, post-processing precipitation data poses greater challenges due to its sporadic nature, characterised by numerous zero values and occasional extreme values. There are two main categories of models for post-processing precipitation forecasts. The first category is nonparametric and does not assume a parametric distribution for precipitation. For example, a relatively simple but popular distribution-free method is Quantile Mapping (QM), which calibrates a forecast by matching the empirical (or parametric) distributions of forecasts and observations in a reference period (Cannon 2018; Michelangeli et al. 2009; Piani et al. 2010). To use the relationship between synoptic meteorology and local weather, analogue methods sample daily precipitation from historical days that have similar atmospheric patterns (Shao and Li 2013; Vannitsem et al. 2021). Quantile-based methods focus on providing discretised quantile estimation, instead of a whole distribution (Kokic et al. 2013). In a recent model, Extended Copula Post-Processing (ECPP), the daily precipitation total is simply treated as a left-censored variable and its dependence structure with GCM precipitation and other possible atmospheric predictors are modelled through copulas (Li and Jin 2020). ECPP forecasts are used for predicting early-season yields in Australia (Jin et al. 2022). The nonparametric methods perform well when plenty of training data are available. Two representative nonparametric methods, QM and ECPP, will be included in the performance comparison later in this paper. QM has been utilised as operational downscaling in Australia (Griffiths et al. 2023). ECPP demonstrated superior performance in site-specific post-processing in recent studies (Jin et al. 2023b; Li and Jin 2020; Li et al. 2020). The second category is parametric-based and assumes a parametric distribution for precipitation. For precipitation, various distributions have succeeded in applications, especially for NWP, such as censored Gaussian distribution (Schepen et al. 2018; Schlosser et al. 2019), a mixture of a point mass at zero and Gaussian distributions (Yumnam et al. 2022), a hurdle distribution with a probability at zero and a gamma distribution for positive precipitation amount, as well as its mixture (Fraley et al. 2010; Sloughter et al. 2007), censored and shifted gamma (CSG) distribution (Baran and Nemoda 2016; Scheuerer and Hamill 2015), censored logistic distribution (Wilks 2009), and censored generalised extreme value distributions (Scheuerer 2014). The parameters of these distributions, such as location, scale and/or shape, are linked with predictors from GCMs with various functional dependencies (Schepen et al. 2018; Schlosser et al. 2019; Vannitsem et al. 2021). The parametric approaches are stable with a small sample, such as limited retrospective forecast years in SCFs. If the assumed distribution is less appropriate, this category of post-processing models may deteriorate the forecast skills. In this study, we will use a mixture of parametric distributions, following (Fraley et al. 2010; Sloughter et al. 2007), to be more flexible for monthly precipitation forecasts at the seasonal scale.

Artificial intelligence, especially deep learning, has recently succeeded in downscaling long-range climate forecasts, as exemplified by recent studies (Jin et al. 2023b; Vitart et al. 2022). These models often face challenges when attempting to surpass the SCF benchmark climatology (Vitart et al. 2022), or outperform site-specific downscaling methods like ECPP for short forecast lead times (Jin et al. 2023b). Furthermore, these models typically demand significant training resources, including data and computational time, even for a small number of locations (Jin et al. 2023a). Note that this study does not aim to directly compare our model with them, as such a comparison lies outside the scope of our work. We also restrict predictors in this study to precipitation forecasts from a single GCM as most operationalised SCF systems are based on one GCM, e.g., (Hudson et al. 2017; Johnson et al. 2019; MacLachlan et al. 2015), with a few exceptions (Kirtman et al. 2014). Additionally, other GCM predictors, such as sea surface temperature, may have insignificant contributions to precipitation forecasts in the context of SCFs, e.g., (Li et al. 2020).

When post-processing an ensemble of precipitation forecasts, the ensemble means/medians are commonly used, such as (Li et al. 2020; Schepen et al. 2018; Scheuerer et al. 2020; Wang et al. 2019; Wilks 2009) as they extract reliable information from all the members. Sometimes the number of zero precipitation members or ensemble dispersion can improve prediction skills (Scheuerer 2014; Scheuerer et al. 2020). When the ensemble members from a GCM are distinguishable such as initialised by different sources in NWP (Baran and Nemoda 2016), all the ensemble members can be used in post-processing, like in Ensemble Model Output Statistics (EMOS) (Baran and Nemoda 2016). In addition, if only some ensemble members are distinguishable, those non-distinguishable members could be treated equally, such as in Bayesian Model Averaging (BMA) (Fraley et al. 2010) for NWF where they share the parametric conditional distribution. Most operational SCF systems, including the three tested in this study, generate ensemble members with the help of perturbations of the initial conditions for a single GCM (Hudson et al. 2017; MacLachlan et al. 2015). These members are not climatologically distinguishable. Thus, most precipitation post-processing techniques for SCFs with a single GCM rely solely on ensemble medians (or means) (Li and Jin 2020; Li et al. 2020; Schepen et al. 2018). They exclude valuable information from individual ensemble members that can contribute to addressing precipitation forecast challenges.

In this study, our objective is to harness the ensemble precipitation forecasts more effectively to enhance precipitation forecast skills. Instead of relying solely on ensemble medians, we use the entire ensemble to create a forecast distribution. We then generate quantiles from this distribution to serve as pseudo-ensemble members. Across forecasts made on different initialisation times for a location, pseudo-ensemble quantile ensemble members become more distinguishable, each representing a unique perspective within the ensembles. Like ensemble medians for typical forecasts, quantiles such as 0.025 (or 0.975) offer reasonable low (high) end of ensemble forecasts. We observe reasonably linear relationships between these quantiles and precipitation observations. Consequently, parametric distributions can be learned based on these pseudo-ensemble members too. We use a hurdle distribution with a point mass at zero for dry months and a gamma distribution for power-transformed positive precipitation amounts (Sloughter et al. 2007). These conditional distributions then compete, forming a composite predictive Probability Density Function (PDF) through BMA. Their weights in the mixture distribution are determined by posterior probabilities, which are proportional to pseudo members’ historical forecast performance in the training period. This mixture distribution is subsequently employed for forecasting based on new pseudo-ensemble members in the future, a model we refer to as Quantile Ensemble BMA (QEBMA).

To ensure a fair comparison among various post-processing methods and avoid the need to specify quantile estimation methods and the number of quantiles, this study demonstrates the performance of QEBMA using the same ensemble size as the raw forecasts. Thus, pseudo-ensemble members are provided by simply sorting the raw ensemble members. As described in Sect. 2, QEBAM is verified and compared with five post-processing models on three GCMs’ retrospective forecast data sets on 32 locations from two case study regions in northeastern Australia. They are from four different climate zones with a forecast lead time of 0 to 2 months. The selected GCMs are GloSea5 from the United Kingdom, ECMWF from Europe and ACCESS-S1 from Australia, which have different numbers of ensemble members, as well as spatial resolution. After model development in Sect. 3, leave-one-month-out cross-validation results are given in Sect. 4. These results demonstrate that QEBMA can improve forecast skills in terms of relative bias, mean absolute error, reliability and overall ensemble forecast skills, Continuous Ranked Probability Score (CRPS), in comparison with raw forecasts, and several existing post-processing techniques including QM, and ECPP using ensemble medians or pseudo ensemble members. The improvements are often statistically significant for the three GCMs. Among these post-processing models, only QEBMA consistently outperforms the seasonal precipitation forecast benchmark, climatology. These comparison results demonstrate the potential of QEBMA, especially after further development discussed in Sect. 5.

2 Case study regions and data

2.1 Case study regions

We use two regions mainly in Queensland, Australia for this study. As shown in Fig. 1, each region has 16 grid locations with integer values of latitude and longitude. Each region has about 160,000 square km. The bottom right region covers South-Eastern Queensland (see in the top-right inset in Fig. 1) and about one-fourth of South Queensland. It has both temperate and subtropical climate zone areas. The top-left region, central west QLD, mostly overlaps with Outback Queensland. It has both grassland and desert climate zone areas. We selected these two regions due to their economic significance in agriculture and mining, along with the presence of quality weather station observations necessary for accurate performance validation. Additionally, these locations encompass four distinct climate zones, showcasing a broad spectrum of dry month proportions ranging from 1.3% to 48.7%, as depicted in Fig. 1.

Fig. 1
figure 1

The 32 grid points, indicated by “ + ” for two case study regions, mainly in Queensland, Australia. A grid point is referred to as Lat-24Lon145 for the location with a latitude of − 24 and longitude of 145 degrees for discussion and visualisation later. The percentage of dry months for each location is shown above each location “ + ”. Dry months are defined as months with an average rainfall of less than 0.1 mm/day

We use SILO gridded data to validate the performance of our precipitation forecasts. The SILO gridded data are represented in a gridded format, where information is available for specific grid points covering a geographical area. They are available free of charge from the Queensland Government, licensed under Creative Commons Attribution 4.0, at https://legacy.longpaddock.qld.gov.au/silo. SILO gridded daily data are constructed from in situ rainfall station records and have been infilled to create a temporally complete record for all grid points with 0.05° resolution (Jeffrey et al. 2001). The monthly rainfall observations are aggregated from these daily data for each grid location. As we mentioned above, these two study regions were chosen because they have quality rainfall stations, thus the monthly gridded data have reasonable quality as well.

2.2 Monthly retrospective forecasts data

We use retrospective forecast data of seasonal precipitation from three Global Climate Models (GCMs): GloSea5, ECMWF, and a calibrated version of ACCESS-S1 (ACCESSc for short hereafter). Each of them is based on a single GCM. The metadata of these three GCMs are listed in Table 1 and briefed below.

Table 1 Three GCMs with different features used in the test and comparison

2.2.1 GloSea5 from UK Met Office

The UK retrospective forecast data are derived from Met Office’s Global Seasonal Forecast system version 5 (GloSea5) (MacLachlan et al. 2015). GloSea5 is an ensemble forecast system centred around the high-resolution UK Met Office climate prediction model, known as the HadGEM3 family atmosphere–ocean coupled climate model. Its notable improvements, compared to its version 4, include enhanced year-to-year prediction accuracy for major climate variability patterns. These enhancements are attributed to increased horizontal resolutions in both the atmosphere (N216–0.7°) and the ocean (0.25°), as well as the implementation of a 3D-Var assimilation system for ocean and sea-ice conditions. GloSea5 became operational in July 2013. GloSea5's monthly retrospective forecasts at 1-degree atmospheric resolution were downscaled from the Copernicus Climate Change Service (C3S) Climate Data Store (CDS), which provides public access to both re-forecast and forecast data.

The retrospective forecast period is from Feb 1993 to Dec 2016 as listed in Table 1. The GloSea5 data consists of a seven-member ensemble for each of the four start dates every month (1st, 9th, 17th, and 25th), forming a 28-member ensemble for monthly forecasts. This ensemble approach can enhance the skill of GloSea5's retrospective raw forecasts and is still suitable for demonstrating the performance improvements achieved through post-processing methods.

2.2.2 ECMWF from Europe

The European Centre for Medium-Range Weather Forecasts (ECMWF) data are sourced from SEAS5, its fifth-generation seasonal forecast system that became operational in Nov 2017 (Johnson et al. 2019). This state-of-the-art SCF system features several notable upgrades, including an improved ocean model (NEMO v3.4.1), a higher-resolution atmosphere model (43r1), and the addition of a new interactive sea-ice model, distinguishing it from its predecessor, SEAS4 (Johnson et al. 2019). The retrospective forecast data from ECMWF commences on the first day of each month, covering the years from 1981 to 2015 and comprising 25 ensemble members (in contrast to the 51 members used in its operational forecasts). Retrospective monthly precipitation forecasts were also downloaded from the C3S C3S Climate Data Store (CDS). Its spatial resolution is 1 degree. ECMWF forecast products are typically adjusted to account for mean biases within the forecast system (MacLachlan et al. 2015).

The minimum value of precipitation forecasts from ECMWF is − 0.00253 mm/day. Due to the ecCodes GRIB packing discretisation procedure, there were packing errors increasing with the range of values. The packing errors may lead to negative values. To overcome this, the strategy used by ECMWF is to set all values in an accumulated field computed by subtraction that is less than a positive threshold to zero.Footnote 1 A threshold, such as 0.04, is recommended as it allows for the multiplication of forecast values by up to ~ 25 in rare long-dry-day-spell cases. In our study, to reduce the possible influence of this strategy, we set the threshold as 0.003 mm/day.

2.2.3 ACCESS-S1 with calibration

Monthly precipitation retrospective forecasts are obtained from the seasonal prediction version of the Australian Community Climate and Earth-System Simulator (ACCESS) Seasonal model, version 1 (ACCESS-S1) (Hudson et al. 2017). The ACCESS-S1 model is a coupled general circulation model developed and tested by the Bureau of Meteorology (BoM) Australia with the key collaboration of the UK Met Office, based on GloSea5. ACCESS-S1 produces daily forecasts of various atmospheric quantities including precipitation on grid points at the resolution of 0.6 degrees. To increase testing environment variety, we use its calibrated version of ACCESS-S1 (termed ACCESSc hereafter) in this study. The calibration is carried out on a daily level with QM against gridded Australian Water Availability Project’s climate datasets (AWAP). The gridded AWAP data are in a spatial resolution of 0.05 degrees and are obtained by interpolating rainfall station observations (Jones et al. 2009). The calibrated daily data then are averaged to the monthly level (Griffiths et al. 2023). The ACCESSc data have 11 ensemble members with lead times of 0–6 months. Their retrospective forecasts are available from 1990 to 2012. There are 48 initialisation dates for each year, and we only use 12 initialisation dates, i.e., the 1st day of each calendar month.

The retrospective forecast data windows in Table 1 are employed to provide representative re-forecast data for demonstrating the performance of our proposed model in this paper. The latest UK Met Office seasonal climate model, GloSea6, also has the same re-forecast data window from 1993 to 2016 as its precursor, GloSea5. While some seasonal climate forecast models, such as ACCESS-S2 (Griffiths et al. 2023), have re-forecast data available for a few more recent years, such as up to 2018, the specific data are not yet publicly accessible. Similarly, the re-forecast data for ECMWF SEAS6, the successor of the current operational SEAS5, are anticipated to become available only in late 2024, following its development plan.

3 Method

After a brief description of Bayesian Model Averaging (BMA), we develop a new post-processing model Quantile Ensemble Bayesian Model Averaging (QEBMA) in this section.

3.1 Bayesian model averaging (BMA)

BMA was originally introduced as a mechanism to integrate predictions from multiple models while considering model uncertainty, resulting in posterior distributions for both model parameters and the models themselves (Fragoso et al. 2018). To produce weather predictions with less bias and higher skills, BMA has been extended to multiple GCMs (Fraley et al. 2010; Sloughter et al. 2007).

In BMA for ensemble forecasting, each ensemble forecast member \(f_{k} (k = 1, \cdots ,K)\), often from a GCM, is associated with a conditional PDF, \(h_{k} \left( {y\;{\mid }\;f_{k} ,\theta_{k} } \right).\) It indicates the distribution of precipitation \(y\) conditional on \(f_{k}\) as the best forecast from the ensemble members with parameter indicated by \(\theta_{k}\). If there are multiple GCMs and each GCM has a set of forecasts, ensemble medians or means from each GCM are normally used as an ensemble member \(f_{k} .\) The BMA predictive PDF for the \(K\) ensemble members is expressed as

$$p\left( {y\;\left| {f_{1} ,\; \ldots ,\;f_{K} } \right.} \right) = \sum\limits_{k = 1}^{K} \; w_{k} \;h_{k} \left( {y\;\;\left| f \right._{k} ,\theta_{k} } \right)$$
(1)

where \(w_{k}\) is the posterior probability of forecast \(k\) being the most appropriate, and is based on the forecast \(k\)’s relative forecast performance over a training data set. As probabilities, \(w_{k} {{^{\prime}s}}\) are nonnegative and add up to 1, i.e., \(\sum\limits_{k = 1}^{K} \; w_{k} = 1\). Before detailing a conditional distribution for monthly precipitation in Sect. 3.3, we introduce QEBMA first.

3.2 QEBMA based on pseudo ensemble forecast members

As we discussed above, BMA assumes these ensemble members are distinguishable from each other. Most climate centres in the world, including the three listed in Table 1, only maintain and run one GCM for seasonal forecasts. Their seasonal ensemble forecasts are mainly from perturbation to initialisation and/or model parameterisation (Johnson et al. 2019), and could not be regarded as distinguishable (as also illustrated by our results in Sect. 4).

Motivated by using the ensemble medians, we can extend to other quantiles derived from all forecast ensemble members to form a new pseudo-ensemble forecast. These pseudo-ensemble members become more distinguishable as they maintain a partial order relationship for the forecasts made on different initialisation dates (Johnson et al. 2019). For example, the pseudo member corresponding to the 0.90-quantile is always not greater than the 0.95 quantile for all the ensemble forecasts. These quantiles can reflect ensemble dispersion, which is often also informative. In addition, these pseudo-members have different relationships with observations. Relating such quantile members to positive precipitation observations, we can expect that a quantile corresponding to a smaller probability has a larger change than a quantile to a larger probability (see, e.g., Fig. 2). In other words, a quantile member corresponding to a smaller probability would have a larger slope if observations were regressed against a given quantile member. The dry months, indicated by red diamonds in Fig. 2, have the narrowest value range for the 0.034 quantiles (the first pseudo ensemble member for GloSea5). The median and the maximum values are \(8.43 \times 10^{ - 3}\) and \(9.22 \times 10^{ - 2}\) respectively for the 0.034 quantiles. The median (or maximum) values for the 0.517 and 0.966 quantiles are 0.169 (or 1.15) and 1.95 (or 7.01) respectively.

Fig. 2
figure 2

Scatter plots of observations against four different pseudo ensemble members corresponding with four quantiles from ensemble forecast distributions for GloSea5 for location Lat-24Lon145, near Blackall Queensland. Black dotted points and red diamonds are for positive and zero observations respectively. For an easy comparison, the value ranges of the x-axis and y-axis are fixed for all four subfigures. Linear lines are fitted for points with observations ≥ 0.1 mm/day. For ensemble size K = N = 28, these four quantiles correspond to pseudo ensemble members 1, 15, 21, and 28 respectively when they are sorted in ascending order

From the K members in an ensemble forecast from a GCM, we can generate a certain number, \(N\), of pseudo-members for a given list of probabilities. \(N\) can equal \(K\) or not, and the latter is useful for forecasts with a different ensemble size from re-forecasts, such as ECMWF (Johnson et al. 2019). We use N equal-distance probabilities to form \(N\) quantiles and denote them as \(q_{1} ,q_{2} , \cdots ,q_{N} .\) Taking these pseudo forecasts for BMA, we still assume each pseudo forecast member \(q_{i} (i = 1, \cdots ,N),\) from a seasonal climate model, is associated with a conditional PDF \(h_{i} \left( {y\;{\mid }\;q_{i} ,\theta_{i} } \right)\) to model precipitation \(y\) conditional on \(q_{i}\) as the best forecast from the pseudo ensemble members with parameter \(\theta_{i}\). For simplicity and easy comparison, we use the same conditional PDF, \(h_{i} \left( {} \right)\) for both BMA and QEBMA. The predictive PDF of QEBMA for the \(N\) pseudo members is expressed as

$$p\left( {y\;\left| {q_{1} ,\; \ldots ,\;q_{N} } \right.} \right) = \sum\limits_{i = 1}^{N} \; w_{i} h_{i} \left( {y\;\;\left| {q_{i} } \right.,\theta_{i} } \right)$$
(2)

where the posterior probabilities \(w_{i} \triangleq p\left( {h_{i} \left| {q_{1} ,\; \ldots ,\;q_{N} } \right.} \right)\;\) are nonnegative and add up to 1, i.e., \(\sum\limits_{i = 1}^{N} \; w_{i} = 1\). The contribution from each pseudo member \(q_{i}\), \(w_{i} h_{i} \left( {y\;\;\left| {q_{i} } \right.,\theta_{i} } \right)\), forms a component prediction. These symbols are listed and explained briefly in Table S1.

3.3 Component probability distribution function for monthly precipitation

To eliminate the influence of the varying number of days in each month, we focus on modelling monthly average precipitation (in mm/day). It is the monthly total divided by the number of days in the month. We still call it monthly precipitation below.

For our SCF post-processing application, the dry month (with monthly average precipitation < 0.1 mm/day) proportion ranges from almost 0 to as high as 48% (see Fig. 1), or even higher for some seasons or some other areas in Australia. Left-censored PDF candidates like censored Gaussian and gamma would have less flexibility for a large range of dry month proportions. As illustrated in Fig. 2, 0 precipitation observations correspond to quite different values of two pseudo members. These left-censored PDFs might have difficulties in accumulating useful information from these quantile ensemble members for forecasting 0 precipitation. We choose to model zero precipitation separately from positive precipitation amount, as Sloughter et al. (2007) did for NWP, by using pseudo ensemble members. Besides issues related to the dry months, histograms of ensemble forecast medians for wet months (with average precipitation ≥ 0.1 mm/day) are quite skewed (see examples in Figure S1). Thus, a flexible and skewed distribution like gamma distribution (See examples in Fig. 3, and Figure S3) is used in our component PDF.

Fig. 3
figure 3

Predictive probability density functions (PDF) and Confidence Internals (CI) by QEBMA on grid point Lat-24Lon145 for Aug 2010, 0-month forecast lead time, for a GloSea5, b ECMWF, c ACCESSc and d legend for these subfigures. Dashed orange vertical lines mark the observed value of 0.92 mm/day for August 2010. Bold black curves depict the predictive PDFs generated by QEBMA. Solid grey vertical lines denote forecast medians, while dotted grey vertical lines represent forecast CIs at an 80% significance level. Beneath each predictive PDF curve, dotted colour curves show component predictions from pseudo-ensemble members. Notably, several component predictions for each GCM are associated with nearly zero weights, aligning closely with the x-axis

To address the issue of heteroscedasticity in positive precipitation values, we fit the gamma distribution to the power-transformed precipitation observations instead of directly fitting it to the observations (Hamill et al. 2004; Sloughter et al. 2007). Following our initial investigation, where we examined power values ranging from 0.2 to 1 across distinct locations corresponding to four climatic zones, it was observed that a power value \(\frac{1}{3}\) often resulted in a better fit for a gamma distribution (see comparison examples in Figure S2 and Figure S3). In addition, the cube root transformation makes precipitation observations more homoscedastic as illustrated in Fig. 2. With this transformation, monthly precipitation observations show a linear relationship with these pseudo-ensemble forecasts. In addition, these linear relationships have different intercepts and slopes for different pseudo members (see examples in Fig. 2). Thus, we also work on the cube root of the monthly precipitation observations \(z = y^{\frac{1}{3}}\). We have the following conditional PDF of monthly precipitation for Eq. (2), given that the pseudo-forecast member \(q_{i}\) is the best forecast:

$$h_{i} \left( {z{\mid }q_{i} ,\theta_{i} } \right) = P\left( {z = 0{\mid }q_{i} } \right)I[z = 0] + P\left( {z > 0{\mid }q_{i} } \right)g_{i} \left( {z{\mid }q_{i} } \right)I[z > 0]$$
(3)

where the general indicator function \(I\left[ \cdot \right]\) equals 1 if the condition in brackets is true and is 0 otherwise. It is a hurdle distribution with a point mass at zero for months without precipitation and a gamma distribution for a positive amount, similar to the daily precipitation in (Sloughter et al. 2007). The two distributions for dry months, \(P\left( {z = 0{\mid }q_{i} } \right)\), and positive precipitation, \(g_{i} \left( {z{\mid }q_{i} } \right)\), are given below respectively. Similar to (Hamill et al. 2004; Sloughter et al. 2007) for daily precipitation, conditional dry month probability for the binomial variable \(z = 0\) given \(q_{i}\), is modelled as a logistic regression,

$${\text{logit}}\left( {P\left( {z = 0{\mid }q_{i} } \right)} \right) = \log \frac{{P\left( {z = 0{\mid }q_{i} } \right)}}{{P\left( {z > 0{\mid }q_{i} } \right)}} = a_{0i} + a_{1i} q_{i}^{\frac{1}{3}} + a_{2i} I\left[ {q_{i} = 0} \right]$$
(4)

The second predictor is from the pseudo forecast member \(q_{i}\) as well when it is zero, which can help smooth out the regression from non-zero to zero for \(q_{i}\). Because a large pseudo ensemble member is expected to correspond to less probability of zero precipitation, the parameter \(a_{1i}\) is expected to be non-negative. When the dry months in the training data are either very rare (or dominant), e.g., its probability \(p_{0} < 0.025\) (or \(> 0.975\)), we impose \(P\left( {z = 0{\mid }q_{i} } \right) \equiv p_{0}\) to ensure that the dry month probability forecasted matches that of the training period. This approach is akin to the QM method and mitigates the logistic regression model fitting issue arising from unbalanced training data.

The positive monthly precipitation amount is modelled by a gamma distribution with shape parameter \(\alpha_{i}\) and scale parameter \(\beta_{i}\),

$$g_{i} \left( {z{\mid }q_{i} } \right) = \frac{1}{{\beta_{i}^{{\alpha_{i} }} \Gamma \left( {\alpha_{i} } \right)}}z^{{\alpha_{i} - 1}} \exp \left( { - \frac{z}{{\beta_{i} }}} \right).$$
(5)

Since approximately linear responses are observed between cube root observations and cube root pseudo forecast members (e.g., Fig. 2), we link, instead of the shape and scale parameters, the distribution’s mean \(\mu_{i} = \alpha_{i} \beta_{i}\) and variance \(\sigma_{i}^{2} = \alpha_{i} \beta_{i}^{2}\) with \(q_{i}^{1/3}\) and \(q_{i}\) respectively. We assume an approximately linear relationship between the expectation of the cube root of monthly precipitation and the cube root of forecast \(q_{i}\)

$$\mu_{i} = b_{0i} + b_{1i} q_{i}^{1/3}$$
(6)

and the variance of precipitation distribution and forecast \(q_{i}\)

$$\sigma_{i}^{2} = c_{0} + c_{1} q_{i}$$
(7)

We allow the intercepts \(b_{0i}\) and slopes \(b_{1i}\) in Eq. (6) to vary among different pseudo ensemble members \(q_{i}\) because smaller ensemble members typically have smaller intercepts and larger slopes (e.g., see Fig. 2). The regression coefficients in Eq. (7) remain consistent across all pseudo ensemble members, assuming similar variance relationships among them. This is often seen in the literature, e.g. (Chakraborty et al. 2015; Sloughter et al. 2007). Maintaining consistent regression coefficients reduces the model’s parameters, preventing overfitting. It also accounts for the typically small monthly-scale training datasets resulting from the relatively short retrospective forecast period of seasonal forecast systems.

3.4 Parameter estimation and forecasts

We use training data \(\left\{ {z_{t} ,q_{1t,} \cdots ,q_{Nt} } \right\}_{t \in T}\) where \(T\) is a retrospective forecast period excluding the month for the model testing of a GCM to determine parameters \(\left\{ {a_{0i} ,a_{1i} ,a_{2i} ,b_{0i} ,b_{1i} ,c_{0} ,c_{1} ,w_{i} } \right\}_{i = 1, \cdots ,N}\) in the mixture model

$$p\left( {z\;\left| {q_{1} ,\; \ldots ,\;q_{N} } \right.} \right) = \sum\limits_{i = 1}^{N} \; w_{i} \;h_{i} \left( {z\;\;\left| {q_{i} } \right.,\theta_{i} } \right)$$
(8)

via maximum likelihood estimation. It involves three procedures. First, the coefficient parameters for dry month probability,\(a_{0i}\), \(a_{1i}\) and \(a_{2i}\) are determined by the pseudo forecast member \(i\) and training observations. When the dry month frequency \(p_{0}\) in the training data is beyond the interval [0.025, 0.975], to avoid unbalanced training data, we let \(a_{0i} = {\text{logit}}\left( {p_{0} } \right),\) and \(a_{1i} = a_{2i} = 0\) such that \(P\left( {z = 0{\mid }q_{i} } \right) = p_{0}\) to forecast the dry month probability identical to that in the training period. For other situations, similar to BMA for daily precipitation prediction (Sloughter et al. 2007), these coefficients are determined separately for each pseudo ensemble member via logistic regression with precipitation/no precipitation as the dependent variable, and cube root of forecast \(q_{i}^{1/3}\) and \(I\left[ {q_{i} = 0} \right]\) as the two predictors in Eq. (3). Secondly, the parameters b0k and b1k are member-specific and are determined through a regression analysis of the cube root of the observations against the cube root of pseudo ensemble forecasts as the predictor. Thirdly, for the remaining parameters \(w_{i}^{{}} ,\) \(c_{0}\) and \(c_{1}\), as the weights have a constraint \(\sum\limits_{i = 1}^{N} \; w_{i} = 1\) in the mixture model, we use an iterative Expectation–Maximisation (EM) procedure. The EM algorithm starts with an initial guess for these parameters with \(w_{i}^{(0)} = \frac{1}{N}\), \(c_{0}^{(0)} = {\text{var}} \left( {z_{t \in T} } \right)\), and \(c_{1}^{(0)} = 0\). In the expectation step, we estimate the expectation of relative probability of \(q_{it}\) among \(N\) quantiles under the parameter estimation in the \(j\) th iteration, denoted by \(v_{it}^{(j + 1)}\) for the \((j + 1)\) th iteration. It is straightforward that \(\widehat{v}_{it}^{(j + 1)} = \frac{{w_{i}^{(j)} \;h_{i} \left( {z_{t} \left| {q_{it} } \right.,\theta_{il}^{(j)} } \right)}}{{\sum\limits_{n = 1}^{N} \; w_{n}^{(j)} \;h_{n} \left( {z_{t} \left| {q_{nt} } \right.,\theta_{nl}^{(j)} } \right)}}\).

The maximisation step estimates \(w_{i}^{(j + 1)}\), \(c_{0}^{(j + 1)}\), and \(c_{1}^{(j + 1)}\) with the current estimates of \(v_{it}^{(j + 1)} ,\) or the alignment of \(q_{it}\) to the pseudo member \(i\). Thus, weight \(w_{i}^{(j + 1)}\) is the average across the training period, \(w_{i}^{(j + 1)} = \frac{{\sum\limits_{t \in T}^{{}} {\widehat{v}_{it}^{(j + 1)} } }}{\left| T \right|}\). There are no analytic solutions for the maximum likelihood estimates of the parameters \(c_{0}\) and \(c_{1}\), and so they must be estimated numerically by optimizing the likelihood \(L\left( {w_{1} ,\; \ldots ,\;w_{N} ,c_{0} ,c_{1} } \right) = \prod\limits_{t \in T}^{{}} {p\left( {z_{t} \;\left| {q_{1t} ,\; \ldots ,\;q_{Nt} } \right.} \right)}\), using the current best estimates of parameters, \(w_{i}^{(j + 1)}\), \(c_{0}^{(j)}\), and \(c_{1}^{(j)}\). This EM algorithm guarantees that the likelihood does not decrease after each iteration (Peel and McLachlan 2000), and consequently converges. In practice, the iteration procedure terminates when the change in likelihood or parameters is smaller than a given tolerance, such as the relative change being smaller than \(10^{ - 5}\). It is worth noting that the EM algorithm does not guarantee to converge to a global maximum, and our parameter estimation is sensitive to the starting values, which is subject to future research.

After all the parameters are estimated, we have a probability distribution, in Eq. (8), for a test month \(m\) conditional on pseudo forecast members. Three forecast distribution examples of QEBMA for the testing month of Aug 2010 are illustrated in Fig. 3. We use these predictive PDFs to generate samples of \(z\), the cube root of the monthly precipitation. From these samples, we can easily produce the resulting probability statements in terms of the original precipitation amount, such as median and confidence intervals, as illustrated in Fig. 3.

4 Verification and comparison results

4.1 Comparison models and implementation

We use the leave-one-month-out cross-validation for performance evaluation and comparison. In model training for a location, for each lead time, we exclude the test month and use the retrospective forecasts and observations from all other months. The procedure is repeated for all the months.

The proposed post-processing model QEBMA is compared with several post-processing models. The first one is to illustrate the importance of quantile ensemble members by comparing QEBMA with BMA which works on the original ensemble forecast members \(f_{k}\) from the GCMs. It is based on Eq. (1), and replaces the quantile ensemble members \(q_{i}\) with \(f_{k}\) in component PDFs in Eq. (3). In other words, in BMA, original ensemble forecast members are regarded as “distinguishable”, which is often not true in SCFs. To facilitate comparison between QEBMA and BMA or other counterparts, we use the same number of ensemble members, i.e., \(N = K\). As a result, both BMA and QEBMA have the same number, \(6K + 1\), of model parameters. Hence, the main difference between QEBMA and BMA is that QEBMA performs on the pseudo ensemble members at any forecast time point.

QM, being used as the operational post-processing method for seasonal forecasts in Australia (Griffiths et al. 2023), maps a raw forecast from a GCM to its corresponding quantile of historical observations. It focuses on adjusting the forecast mean as well as the ensemble spread (Wood et al. 2002). For a univariate climate variable like precipitation \(Y\), we obtain a raw forecast of this variable \(f_{k}\) from a GCM. We denote the Cumulative Distribution Functions (CDFs) of all the raw forecast ensemble members and observations in the reference period by \(F_{j}\) and \(F_{0}\), respectively. The QM post-processed forecast \(y_{k}^{(QM)}\) can be formulated as \(y_{k}^{(QM)} = F_{o}^{ - 1} \left( {F_{f} \left( {f_{k} } \right)} \right)\), where \(F_{o}^{ - 1}\) is the inverse function of \(F_{0}\). We use the empirical distribution of raw forecasts and observations over the training period as the estimates of \(F_{j}\) and \(F_{0}\) Only training data from one or zero months away from the target month are used. For example, for August 2008, training pairs include data from July to September, excluding the year 2008. QM is carried out at the level of individual ensemble members and keeps the same ensemble size as the raw forecasts. To check whether the relatively more distinguishable pseudo ensemble forecasts improve post-processing, we further restrict empirical distribution estimation to each quantile member, instead of all the members. We refer to this version as QMq in the comparison.

For comparison, we also use a recent post-processing model ECPP (Li and Jin 2020) on monthly precipitation, as it performs the best among site-specific nonparametric postprocessing methods (Jin et al. 2023b; Li et al. 2020). ECPP uses copulas, a powerful statistical tool to model the dependence structure among random variables, to model precipitation observations \(Y\) and ensemble forecast medians \(f_{M}\). Copulas conveniently separate the dependence structure of random variables from their marginal distributions without data transformation (Nelsen 2006). To make use of the efficient computation of the classical parametric copula, we treat monthly precipitation as left-censored at 0. Under the left-censoring assumption, the underlying variables \({X}_{o}\) and forecast \({X}_{f}\) are assumed to be continuous variables and can take values less than 0, such that \(Y = \max (X_{o} ,0)\) and \(f_{M} = \max (X_{f} ,0)\). When \({X}_{o}\) or \({X}_{f}\) is less than 0, we only observe the value of 0 and do not know its exact value. The underlying true variable \({X}_{o}\) and forecast variable \({X}_{f}\) are formulated as a bivariate copula function \(C\) with a parameter \(\theta\) as \(F(X_{o} ,X_{f} ) = C\left\{ {F_{o} \left( {X_{o} } \right),F_{f} \left( {X_{f} } \right);\theta } \right\}\). The conditional distribution \(F({X}_{o}|{X}_{f})\) to generate post-processed forecasts conditional on the raw forecast \({X}_{f},\) is\(F({X}_{o}|{X}_{f};\theta )=\frac{\partial C\left({F}_{o}\left({X}_{o}\right),{F}_{f}\left({X}_{f}\right);\theta \right)}{\partial {F}_{f}({X}_{f})}\). We use empirical distributions \({F}_{o}\) and \({F}_{f}\) based on the historical observations and retrospective forecast data as it is convenient for empirical distributions to deal with possible multimodal distributions and zero precipitation observations. For more technical details please refer to (Li and Jin 2020). We estimate the copula parameter \(\theta\) via maximum likelihood estimation for each location, month, and forecast lead time combination. Like QM, we include the three nearest months to increase the training data size. Once we complete the estimation process, we generate an ensemble forecast from the conditional forecast distribution \({\widehat{Y}}_{s}={F}_{o}^{-1}\left({F}^{-1}\left(u\left|v;\theta \right.\right)\right)\), where \(u\) is a random number from the uniform distribution\(U[\mathrm{0,1}]\), \({F}_{o}^{-1}\) is the quantile function of the precipitation observation and \({F}^{-1}(\cdot|\cdot)\) is the inverse function of the conditional forecast distribution. \(v\) equals \({F}_{f}\left({f}_{M}\right)\) when\({f}_{M}>0\), and is a random number from the uniform distribution \(U\left[0,{F}_{f}\left(0\right)\right]\) when\({f}_{M}=0\). For a GCM with \(K\) ensemble members, we repeat the simulation procedure \(10 \times K\) times to generate a post-processed ensemble forecast as suggested in Li and Jin (2020).

To check how the pseudo ensemble forecasts help ECPP, we extend ECPP to each pseudo member \(q_{i} (i = 1, \cdots ,K)\) separately, instead of the ensemble median\({f}_{M}\), to generate 10 simulations independently. These \(10 \times K\) forecasts together form a post-processed ensemble forecast. We name this model ECPPq.

To provide a benchmark for forecasts, researchers often compare forecast techniques with a naïve climatology forecast (Jin et al. 2022; Li and Jin 2020; Li et al. 2020; Schepen et al. 2018). It uses historical observations (except data from the test time window) to form an ensemble forecast. The reference period used in this study is from 1980 to 2018, a total of 39 years. For example, in the case of generating the climatological reference forecast for Jan 2000, we use the historical observations from January other than the year 2000 (i.e. from 1980–1999 and 2001–2018) to form a reference forecast with 38 ensemble members for the leave-one-month-out cross-validation. That means, the reference forecast climatology, has 38 ensemble forecast members.

We implemented the QEBMA and its counterpart models in R, especially using/modifying packages like scoringutils, qmap, VineCopula, and ensembleBMA (Fraley et al. 2018; R Core Team 2022; Schepsmeier et al. 2015). We used a relatively short common period from Feb 1993 to Dec 2010 to enable comparison among the three GCMs.

4.2 Forecast verification metrics and skill scores

To assess the forecast models, cross-validation is conducted for both deterministic and probabilistic forecasts. Ensemble medians from forecasts are treated as deterministic forecasts. To facilitate the comparison among different post-processing models (e.g., QM, ECPP and BMA) across five metrics and three GCMs, we calculate a skill score for each metric in comparison with the reference forecast climatology, with values ranging from \(-\infty\) to 1. A positive skill score, often as a percentage, indicates a skilled post-processing model, and higher scores correspond to better forecasts (Li et al. 2020). A skill score value of 1 (100%) represents a perfect forecast, where all forecasts match their target observations, while a score of 0 indicates performance equivalent to climatology for the metric. Differences in skill scores are regarded as indicators of improvements.

4.2.1 Relative bias

Bias, i.e., the difference between a deterministic forecast and an observation, is often used as a metric. These post-processing models QM, ECPP and BMA normally get systematic biases corrected. For example, the biases of the medians for the post-processing models are close to zero for different lead times, locations, and GCMs as illustrated in Figure S4 and Figure S5. The mean biases of QEBMA are normally less than 0.6 mm/day. Given the skewness of monthly precipitation observations, as illustrated in Figure S1, a bias of 0.6 mm/day on average is less concerning to end users for larger observation values (e.g., 10 mm/day) compared to smaller observation values (e.g., 1 mm/day). Therefore, we focus on comparing relative biases that are defined by the differences between the ensemble medians and observations normalised by the observations (Khajehei and Moradkhani 2017; Li et al. 2020). With such relative bias we can check whether a forecast tends to make an over- or under-estimation (indicated by positive or negative relative bias) and how much that forecast median is deviated from the observations. To avoid possible division by zero, we lift the observation with a constant of 0.8 mm/day. Thus, the relative bias of \(y_{f,t}\) w.r.t. \(y_{o,t}\) at a time \(t\) is \(E_{t} = \frac{{y_{f,t} - y_{o,t} }}{{y_{o,t} + c_{e} }}\). To facilitate comparisons among different post-processing models across GCMs, we further calculate the relative bias skill score in comparison with the reference forecast, climatology. For each of the 12 months, we calculate the average of absolute relative biases \(\overline{{\left| {E_{t} } \right|}}^{(M)}\) for the post-processing model \(M\). The relative bias skill score is \(\left( {1 - \frac{{\overline{{\left| {E_{t} } \right|}}^{(M)} }}{{\overline{{\left| {E_{t} } \right|}}^{(ref)} }}} \right) \times 100\%\). A zero skill score indicates the same relative bias as the cross-validation version of the climatology forecast. The average relative bias skill score across 12 months is the final relative bias skill score for a given location and forecast lead time.

4.2.2 Mean absolute error (MAE)

MAE measures deterministic forecast accuracy, \(MAE_{t} = \left| {y_{f,t} - y_{o,t} } \right|\). Like the relative bias skill score above, MAE is averaged across the years first. Against the average MAE of the reference forecast for each month, the MAE skill score is \(1 - \frac{{\overline{{\left| {MAE_{t} } \right|}}^{(M)} }}{{\overline{{\left| {MAE_{t} } \right|}}^{(ref)} }}\). The 12-month average is the final MAE skill score for a given location and a lead time.

4.2.3 Forecast coverage and reliability

For calibration of a probabilistic forecast of model \(M\), it is useful to check its coverage \({\text{cov}}_{t}^{(M)}\) of (1 − αc) × 100% central prediction interval for a given αc ∈ (0,1). It is the proportion of validating observations located between \(\frac{{\alpha }_{c}}{2}\) and \(\left(1-\frac{{\alpha }_{c}}{2}\right)\) quantiles of the predictive distribution or ensemble forecasts. Considering the minimum ensemble size of the three GCMs is 11, we set αc = 0.2 to allow direct comparisons with the raw ensembles. Three examples in Fig. 3 illustrate that the observation is located within the three 80% confidence intervals of probabilistic forecasts generated by QEBMA for the three GCMs. As closer average coverage to 1-αc is better, to simplify comparison, coverage skill score in comparison with the reference forecast is defined as \(\frac{{\overline{{\left| {{\text{cov}}_{t}^{(M)} - (1 - \alpha_{c} )} \right|}} - \overline{{\left| {{\text{cov}}_{t}^{(ref)} - (1 - \alpha_{c} )} \right|}} }}{{\overline{{\left| {{\text{cov}}_{t}^{(ref)} - (1 - \alpha_{c} )} \right|}}^{{}} + 0.2}}\) where 0.2 is added in the denominator to avoid division by zero.

To eliminate influence from specifying αc, another metric reliability is also used. It is an attribute to characterise the difference between the observed and forecast frequency of an event over a forecast period. We measure the forecast reliability by the α-index (Renard et al. 2010), defined as

$$\alpha = 1 - \frac{2}{n}\sum\limits_{t = 1}^{n} {\left| {p_{t}^{*} - \frac{t}{n + 1}} \right|}$$
(9)

where \(p_{t}^{*}\) is the sorted forecast probability integral transform of rainfall observations \(p_{t} = F_{ens,t} \left( {y_{o,t}^{{}} } \right)\) at time \(t\) and \(F_{ens,t}\) is the distribution of an ensemble forecast at time \(t\). \(p_{t}^{*}\) is expected to be uniformly distributed if the total \(n\) forecasts made are reliable as illustrated by histograms in Figure S6. The α-index ranges from 0 (worst reliability) to 1. For location Lat-24Lon145, QEBMA has the highest reliability value of 0.956 (Figure S6). Comparing the α-index of a model M with the reference model, the reliability skill score is calculated as \(\frac{{\alpha^{(M)} - \alpha^{(ref)} }}{{\alpha^{(ref)} }}.\)

4.2.4 Continuous ranked probability score

The Continuous Ranked Probability Score (CRPS), a surrogate measure of forecast bias, reliability, sharpness and efficiency, is used to evaluate the overall forecast skill (Hersbach 2000). It is a popular proper score and is widely used in evaluating probabilistic forecasts (Fraley et al. 2010; Jin et al. 2022; Li et al. 2020; Schepen et al. 2018; Sloughter et al. 2007). For a probabilistic forecast distribution function \(F_{ens,t} (y)\) for the observation \(y_{o,t}\) at time \(t\), it is a quadratic measure of the difference between \(F_{ens,t} (y)\) and the empirical distribution of observation \(CRPS_{t} = \int {\left( {F_{ens,t} (y) - I\left[ {y \le y_{o,t} } \right]} \right)^{2} dy} .\) We standardise the average CRPS for model M concerning the reference forecast and report it as the CRPS skill score

$$CRPS{\text{ Skill Score}} = \left( {1 - \frac{{\overline{{CRPS^{(M)} }} }}{{\overline{{CRPS^{(ref)} }} }}} \right) \times 100\%$$
(10)

where \(CRPS^{(ref)}\) is the CRPS calculated from the climatology forecast. When the CRPS skill score > 0, we say it has a positive skill. The maximum CRPS skill score is 1, suggesting a perfect forecast as all ensemble members are identical to their target observations.

Three post-processed examples by QEBMA are illustrated in Fig. 3 for grid point Lat-24Lon145 for Aug 2010. As depicted by the dotted colour curves beneath the prominent black curve in Fig. 3a, the majority of component predictions contributed by pseudo-ensemble members exhibit minimal weights, with their dotted colour curves closely hugging the x-axis. Only seven out of 28 pseudo ensemble members have weights > 1% for GloSea5. They are, in the decreasing order of their weights, \(q_{21}\)(with weight \(w_{21} = 28.9\%\)), \(q_{7}\)(24.6%), \(q_{13}\)(24.6%), \(q_{3}\)(12.1%), \(q_{11}\)(8.52%), and \(q_{26}\) (1.19%). As the observation is closer to the PDF median, and within the forecast confidence interval, the forecast has a CRPS of 0.279. In comparison with CRPS = 0.458 of the reference forecast climatology, its CRPS skill score is 0.391.

For ECMWF, QEBMA has six pseudo members (Fig. 3b) with weights > 1%:\(q_{10}\) (\(w_{10} = 33.9\%\)), \(q_{22}\)(29.2%), \(q_{6}\)(16.2%), \(q_{25}\)(8.28%), \(q_{3}\)(7.8%) and \(q_{9}\)(4.15%). Its CRPS is 0.292, and its CRPS skill score is 0.361.

For ACCESSc, in Fig. 3c QEBMA has three pseudo members with weights higher than 1%: \(q_{2}\)(\(w_{2} = 49.5\%\)),\(q_{10}\)(41.5%), and \(q_{1}\)(9.0%). Its CRPS is 0.497, higher than that of Climatology, resulting in a negative CRPS skill score of − 0.087.

For each location and lead time combination, we calculate the five skill scores for each post-processing model and raw GCM forecasts, which are reported for the three GCMs in the coming subsections.

4.3 Results on GloSea5

Table 2 summarises the skill scores of five metrics over three different lead times and 32 locations from the 215 months for GloSea5.Footnote 2 QEBMA performs the best in terms of relative bias, MAE, and CRPS, for which, the improvement is statistically significant at the level of 0.05 (as indicated by ‘*’ in Table 2) over the raw GloSea5 forecasts and the other five post-processing models. For appropriate coverage of observations within 80% central prediction intervals, QEBMA is better than all the other forecasts except BMA. For reliability, QEBMA is better than raw GloSea5 forecasts, QMq, ECPPq, and BMA, comparable with QM, and slightly worse than ECPP. Compared with the raw GloSea5 forecasts, QEBMA performs statistically significantly better in terms of all five skill scores. The seven models except QEBMA have one or more negative scores on the five metrics, which means only QEBMA is overall better than the reference model climatology on all the five metrics.

Table 2 Average skill scores (%, higher is better) over three lead times and 32 grid points on GloSea5

The boxplots of skill scores over 32 locations for 0 to 2-month forecast lead times are given in Fig. 4. QEBMA normally has higher skill scores than raw forecasts and the other post-processing models, especially ECPPq and BMA on the five metrics and three lead times. Skill scores normally decrease with forecast lead time, except coverage and reliability as the forecast difficulty increases. Let's further examine 0-month lead time forecasts, typically made on the first day of a given month. Compared with QM, QEBMA has comparable MAE scores while higher relative bias scores in general, indicating that QM may be worse at forecasting months with large precipitation amounts. Such differences are reflected too in the relatively lower CRPS skill scores of QM. Compared with ECPP, QEBMA has comparable skill scores on MAE and reliability, and higher scores on relative bias, coverage and CRPS.

Fig. 4
figure 4

Boxplots of skill scores (%, higher is better) of GloSea5 and five post-processed models over 32 locations for three different lead times (0, 1, and 2 months). QMq is not included for a better visual comparison of CRPS skill scores

We further examine the ensemble forecast skill scores for 0-month lead time forecasts across all 32 locations. As illustrated in Fig. 5a, we can see QEBMA has stable skill scores over 32 locations in the four different climate zones. For 31 out of 32 locations, QEBMA shows a skill improvement over raw GloSea5 forecasts, with a mean CRPS skill score improvement of 3.54% (Fig. 5b). QEBMA has higher scores than QM in 31 out of 32 locations. The average CRPS skill score improvement is 3.25%. It outperforms ECPP in 28 out of 32 locations, with an average score improvement of 3.35%. QEBMA outperforms ECPPq in all 32 locations with an average improvement of 7.90%. It outperforms BMA in 29 out of 32 locations with an average improvement of 3.27%.

Fig. 5
figure 5

Spatial view of CRPS skill score (in %) of QEBMA and its differences from raw GloSea5, and four post-processing models QM, ECPP, ECPPq, and BMA with a lead time of 0 months. Each subplot has its own colour bar for easy comparison. + (or −) indicates the positive (or negative) values

For the forecasts made for Dec, Jan and Feb with 0-month lead time, QEBMA has positive skills in general, with positive CRPS skill scores up to 19.0%. It normally has higher skills than raw forecasts of GloSea5 and the other post-processing models as illustrated in Fig. 6. For 0-month lead time forecasts made for Jun, Jul, and Aug, QEBMA has higher average skill scores in most locations (Fig. 7). Compared with raw GloSea5 forecasts, the improvement of QEBMA in the outback case study region is not as high as that in the Southeast Queensland case study region.

Fig. 6
figure 6

Spatial view and comparison of CRPS skill score (in %) of QEBMA over 32 locations with a lead time of 0 months for GloSea5 made for all the summers (Dec, Jan and Feb). Each subplot has its own colour bar for easy comparison. + (or −) indicates the positive (or negative) values

Fig. 7
figure 7

Spatial view and comparison of CRPS skill scores (in %) of QEBMA over 32 locations with a lead time of 0 months for GloSea5 made for all the Winters (Jun, Jul and Aug). Each subplot has its own colour bar for easy comparison. + (or −) indicates the positive (or negative) values

4.4 Results on ECMWF

The skill scores of the five metrics for ECMWF are summarised in Table 3. On relative bias, QEBMA has an average score of 1.75%, which is the only positive score and the best among the seven forecast models. On the second deterministic forecast metric MAE, QEBMA has the highest score of 7.13%, which is statistically significantly better than the other six models. On the metric of coverage, QEBMA is better than all the other models, except BMA. QEBMA has the highest average reliability skill score of 1.73%, which is statistically significantly better than all the other models except ECPP. QEBMA has an average CRPS skill score of 11.64%, which is statistically significantly better than the other six models. Compared with the raw ECMWF, only one post-processing model, QEBMA has higher average skill scores on all the five metrics. Furthermore, only QEBMA outperforms climatology on the five metrics on average, as it is the only one with five positive skill scores.

Table 3 Average skill scores (%, higher is better) over three lead times and 32 grid points on ECMWF

As illustrated in Fig. 8, the raw ECMWF forecasts deteriorate with lead time on all five metrics, like all the post-processing models’ forecast skill scores. BMA has inferior performance on CRPS and MAE. ECPPq does not improve CRPS and MAE in general, and even could not improve reliability on 0-month lead time forecasts. QM and ECPP don’t improve the ECMWF’s forecast performance for the three different lead times on relative bias, MAE and CRPS. Both of them increase the skill scores on coverage and reliability to around 0, i.e., comparable with climatology. For all three forecast lead times, QEBMA has generally higher average scores than the raw ECMWF forecasts on the five metrics. Given the clear improvements on the 0-month lead time forecasts in Fig. 8, let's discuss its performance on the 1-month lead time forecasts, focusing on the two key metrics: MAE for deterministic forecasts and RMSE for ensemble forecasts.

Fig. 8
figure 8

Boxplots of skill scores (%, higher is better) of raw ECMWF and five post-processed models over 32 locations for three different lead times (0, 1, and 2 months). QMq is not included for a better visual comparison of CRPS skill scores

As illustrated in Fig. 9 for 1-month lead time forecasts, MAE skill scores of QEBMA range from − 2.55% to 7.39% with a mean of 3.79%. QEBMA has positive scores on 29 out of 32 locations, instead of 32 out of 32 locations for 0-month lead time forecasts. Compared with raw ECMWF forecasts, QEBMA has higher scores on 21 locations with a slightly higher mean (0.81% higher) score. Compared with QM, QEBMA has higher MAE skill scores on 22 locations with a 0.77% higher mean. Compared with ECPP, QEBMA has 23 out of 32 locations with a 2.50% higher mean score. Compared with ECPPq, QEBMA has 23 out of 32 locations with a 1.47% higher mean MAE skill score. QBMA’s average MAE skill score differences with ECPP and QM decrease with forecast lead time. They are 4.33%, 2.50%, and 1.40% against ECPP and 2.89%, 0.77%, and − 0.51% against QM.

Fig. 9
figure 9

Spatial view of MAE skill scores (%, higher is better) of QEBMA and its differences from raw and four post-processed ECMWF with 1-month forecast lead time. Each subplot has its own colour bar for easy comparison. + (or −) indicates the positive (or negative) values

From Fig. 10a, we can see QEBMA has stable overall ensemble forecast skills, with a mean CRPS skill score of 8.68%, over the 32 locations for the lead time of 1 month for ECMWF. QEBMA has higher CRPS skill scores than the other models on at least 30 out of 32 locations except ECPP. Against ECPP, QEBMA has higher scores in 25 locations. In addition, QBMA’s average CRPS skill score differences with its two closest post-processing competitors ECPP and QM decrease with forecast lead time. They are 4.50%, 2.45%, and 1.86% against ECPP and 5.06%, 3.51%, and 3.40% against QM.

Fig. 10
figure 10

Spatial view of CRPS skill scores of QEBMA and its differences from raw and four post-processed ECMWF with a lead time of 1 month

4.5 Results on ACCESSc

The skill scores of the five metrics for ACCESSc and six post-processing models are summarised in Table 4. QEBMA has statistically higher scores than raw ACCESSc forecasts, QM, and QMq on all five metrics. Compared with ECPP, QEBMA has much higher scores on relative bias and coverage, and comparable scores on MAE, reliability and CRPS. Compared with ECPPq, QEBMA has statistically significantly higher scores on relative bias, reliability and CRPS and comparable scores on MAE and coverage. QEBMA outperforms BMA on all the metrics except coverage.

Table 4 Average skill scores (%, higher is better) of ACCESSc and six post-processing models over three lead times and 32 grid points

Figure 11 illustrates the skill scores of the five metrics for different forecast lead times for ACCESSc. It is visible that QEBMA outperforms all its counterparts on the 0-month lead time forecasts. Its advantages decrease with forecast lead time. Compared with its closest competitor ECPP, its average MAE skill score differences decrease from 2.26%, − 0.50% and − 0.46% for 0 to 2-month lead times. Its average CRPS skill score differences from ECPP decrease from 2.22%, to − 0.92% and − 0.25% for 0 to 2-month lead times.

Fig. 11
figure 11

Boxplots of skill scores (%, higher is better) of ACCESSc and five post-processed models over 32 locations for three different lead times (0, 1, and 2 months). QMq is not included for a better visual comparison of CRPS skill scores

4.6 Discussions

Comparing the raw forecasts of the three GCMs, it is evident from Tables 2, 3, 4 that these post-processing models exhibit varying degrees of skill improvement. Specifically, for ECMWF, both QM and ECPP deteriorate forecast performance in terms of relative bias, MAE and CRPS averaging across the 32 locations. Notably, among the six post-processing models, only QEBMA consistently outperforms raw forecasts in terms of these metrics for all three GCMs.

As skill scores are calculated against the same reference model, we can average the improvements in these post-processing models over raw forecasts from the three GCMs. Table 5 illustrates that, on average, QEBMA exhibits the highest skill improvements across four of five metrics, except for coverage on which it slightly trails behind BMA. Notably, only QEBMA and ECPP among the six post-processing models demonstrate positive skill scores on all five metrics. This suggests that forecasts generated by QEBMA and ECPP generally outperform the raw forecasts provided by the GCMs. Additionally, QEBMA outperforms ECPP on all five metrics.

Table 5 Average skill scores (%) improved from the raw forecasts of three GCMs

QEBMA’s skill improvement depends on all the ensemble members of the raw GCM forecasts. For poor raw forecasts, such as the 2-month lead time forecasts from ACCESSc, QEBMA’s skill can be inferior to the reference model climatology, e.g., MAE in Fig. 11. For longer forecast lead times, such as 3 or more months, the performance of QEBMA will be examined in future, especially after the forecast skills of GCMs are further enhanced.

In this section, QEBMA uses the same ensemble size as the raw forecasts from a GCM. Thus, pseudo-ensemble members are provided by simply sorting the raw ensemble members, and the only difference between QEBMA and BMA is that QEBMA is conditional on the raw ensemble members after sorting. From Tables 2, 3, 4, it could be found that QEBMA outperforms BMA on four out of five metrics for each of the three GCMs. Similarly, as listed in the last two columns in Table 5, QEBMA has a higher average improvement from the raw forecasts of three GCMs than BMA on the five metrics, except coverage. For both metrics MAE and CRPS, QEBMA is superior to the raw forecasts of three GCMs, but BMA is inferior. These comparisons illustrate the extra information QEBMA extracts from the quantiles of an entire set of raw ensemble members.

QMq and ECPPq don’t have better performance than their counterparts QM and ECPP, especially on the two key probabilistic forecast metrics, reliability and CRPS. They treat the pseudo-ensemble forecast members equally. When these pseudo-ensemble members compete with each other as in QEBMA, more than two-thirds of pseudo-ensemble members have quite small weights (see examples in Fig. 3) based on their historical performance and could not contribute to post-processed forecasts that much. This means these pseudo-members should not be treated equally.

The improvement in forecast performance with QEBMA stems from multiple factors. Operating in a manner akin to conventional post-processing techniques, QEBMA utilises historical observational data and GCM re-forecasts to refine GCM forecasts. Additionally, QEBMA makes better use of ensemble information by transforming individual members into pseudo-ensemble members. A competitive weighting mechanism is employed to differentiate the influence of individual pseudo members, including ensemble medians. This strategic alignment process brings forecast ensembles into better agreement with the relationship between re-forecasts and historical observational data, consequently enhancing the accuracy and skill of monthly precipitation forecasts.

The proposed QEBMA method exhibits certain limitations. Its performance is intricately tied to the raw forecast performance of a GCM, aligning with the characteristics of most forecast post-processing techniques. Its parameter estimation procedure aims for a local optimum. This estimation procedure is relatively slow, particularly when dealing with a substantial number of pseudo ensemble members. QEBMA adopts an approach of assigning the same size to the pseudo members as that of the ensemble members for a GCM, a choice that proves suboptimal. This is attributed to the fact that a significant proportion of pseudo ensemble members contribute minimally, given their weights are close to zero. Consequently, this aspect may diminish the effectiveness of the QEBMA method.

This paper endeavours to improve the accuracy and skill of ensemble seasonal precipitation forecasts at a monthly scale by introducing a parametric post-processing technique, QEBMA, that leverages individual ensemble members. The evaluation spans 32 locations and involves three distinct GCMs. In the realm of seasonal forecasts, prevalent non-parametric (Griffiths et al. 2023; Monhart et al. 2018; Shao and Li 2013) or deep-learning-based models (Jin et al. 2023a, 2023b; Vitart et al. 2022) typically treat all ensemble members equally. Conversely, most parametric post-processing techniques for seasonal precipitation forecasts predominantly rely on ensemble medians (or means) (Li and Jin 2020; Li et al. 2020; Schepen et al. 2018; Wang et al. 2019), thereby potentially neglecting valuable forecast information of ensemble members. These studies span diverse temporal scales, encompassing daily accumulated precipitation (Griffiths et al. 2023; Jin et al. 2023a, 2023b; Li and Jin 2020; Li et al. 2021; Monhart et al. 2018; Schepen et al. 2018; Shao and Li 2013), weekly (Monhart et al. 2018), fortnightly (Vitart et al. 2022). Although originally designed for these temporal scales, they offer potential applicability to monthly or seasonal precipitation forecasting, e.g., (Li et al. 2020; Wang et al. 2019). In addition to GCM precipitation forecasts, certain post-processing studies incorporate supplementary predictors (Jin et al. 2023b; Li et al. 2020; Scheuerer et al. 2020; Vitart et al. 2022) without a consistent conclusion. Most other studies, to our knowledge, typically only present forecast assessment results for a single GCM. In contrast, this study goes further by incorporating performance assessment results for three distinct GCMs.

5 Conclusions

To make better use of all the ensemble members in probabilistic monthly precipitation forecasts, we have proposed Quantile Ensemble Bayesian Model Averaging (QEBMA) model for post-processing Seasonal Climate Forecasts (SCFs). It has taken an ensemble forecast as a whole and uses its different quantiles to form pseudo-ensemble forecast members that build up reasonable connections at different forecast initialisation times. To embody a linear relationship between a pseudo ensemble member with observations that we have observed after cube root transformation, a hurdle distribution with a point mass at zero for dry months and a gamma distribution for positive precipitation amount has been used conditional on each pseudo member. These distributions have then mixed to form a flexible predictive probability distribution whose weights are proportional to their historical forecast performance. The evaluation of 32 locations in Australia and three seasonal forecast systems has demonstrated that QEBMA can often statistically significantly outperform raw forecasts, several existing post-processing models and the seasonal forecast benchmark climatology in terms of five forecast metrics, including relative bias, mean absolute error, reliability and continuous ranked probability score. As only quantiles of ensemble forecasts are used, QEBMA is also suitable for SCFs having different ensemble sizes in the retrospective forecast period from its operational setting, such as ECMWF (Johnson et al. 2019).

For a fair comparison, we have used the same number of quantiles as the original ensemble forecasts. Considering a lot of pseudo ensemble members contribute little to the final forecast distribution, it would be interesting to examine if a better number of quantiles should be used. We focus on the post-processing single location, without considering the spatio-temporal correlation in the forecast. It may be our future work on how to include spatial and temporal structure into the post-processing seasonal precipitation forecasts. The idea of using pseudo ensemble members can be used for improving model averaging from multiple ensemble forecasts, such as combining GloSea5, ECMWF, and ACCESS-S1 together. We are planning to test how the idea can be extended to handle multiple GCMs. For cross-model comparison and analyses in this paper, a relatively short common period of three GCMs for retrospective forecast data is used for performance verification and comparison. Subsequent research will comprehensively assess the proposed QEBMA method and its variants, exploring longer retrospective forecast data windows, extended lead times, operational seasonal forecasts, seasonal scales, and related aspects.