Area-covering postprocessing of ensemble precipitation forecasts using topographical and seasonal conditions

Probabilistic weather forecasts from ensemble systems require statistical postprocessing to yield calibrated and sharp predictive distributions. This paper presents an area-covering postprocessing method for ensemble precipitation predictions. We rely on the ensemble model output statistics (EMOS) approach, which generates probabilistic forecasts with a parametric distribution whose parameters depend on (statistics of) the ensemble prediction. A case study with daily precipitation predictions across Switzerland highlights that postprocessing at observation locations indeed improves high-resolution ensemble forecasts, with 4.5%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4.5\%$$\end{document} CRPS reduction on average in the case of a lead time of 1 day. Our main aim is to achieve such an improvement without binding the model to stations, by leveraging topographical covariates. Specifically, regression coefficients are estimated by weighting the training data in relation to the topographical similarity between their station of origin and the prediction location. In our case study, this approach is found to reproduce the performance of the local model without using local historical data for calibration. We further identify that one key difficulty is that postprocessing often degrades the performance of the ensemble forecast during summer and early autumn. To mitigate, we additionally estimate on the training set whether postprocessing at a specific location is expected to improve the prediction. If not, the direct model output is used. This extension reduces the CRPS of the topographical model by up to another 1.7%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.7 \%$$\end{document} on average at the price of a slight degradation in calibration. In this case, the highest improvement is achieved for a lead time of 4 days.


Introduction
Today, medium-range weather forecasts are generated by Numerical Weather Prediction (NWP) systems which use mathematical (or physics-based, numerical) models of the atmosphere to predict the weather.NWP forecasts are affected by considerable systematic errors due to the imperfect representation of physical processes, limited spatio-temporal resolution, and uncertainties in the initial state of the climate system.This initial condition uncertainty and the fact that the atmosphere is a chaotic system, where small initial errors can grow into large prediction errors, make weather forecasting challenging (Wilks and Vannitsem 2018).Therefore, attention has turned to probabilistic weather forecasting to quantify weather-dependent predictability from day to day.Such probabilistic forecasts are generated using different forecasting scenarios (referred to as ensemble members) based on slightly perturbed initial conditions and perturbed physcial parameterizations in the NWP system.Unfortunately, such ensemble forecasts are not able to capture the full forecasting uncertainty, and hence ensemble forecasts are often biased and over-confident (Wilks 2018).Statistical postprocessing can be used to calibrate ensemble forecasts.The objective of statistical postprocessing is to find structure in past forecast-observation pairs to correct systematic errors in future predictions.Various approaches to postprocess ensemble predictions have been developed over the last years, an overview is given for example in Wilks (2018).Two main approaches are Bayesian model averaging (BMA; Raftery et al. 2005) and ensemble model output statistics (EMOS; Gneiting et al. 2005).The BMA approach generates a predictive probability density function (PDF) using a weighted average of PDFs centred on the single ensemble member forecasts.The EMOS approach provides a predictive PDF using a parametric distribution whose parameters depend on the ensemble forecast.
The aim of this paper is to provide calibrated and sharp precipitation predictions for the entire area of Switzerland by postprocessing ensemble forecasts.The postprocessing model should account for seasonal specificities and while it is developed at observation locations, it should also be applicable at unobserved locations and thereby allow to produce area-covering forecasts.As a starting point, we apply the censored Logistic EMOS model of Messner et al. (2014) to a dataset of daily precipitation predictions for Switzerland.To estimate the parameters of the predictive distribution, the model is fitted on a training dataset.There are two basic approaches to select the training data: A local model fits the coefficients only with past data from the same location while a global model uses all the available forecast observation pairs.We show that the local model outperforms the global model in our case study with daily precipitation forecasts.Thorarinsdottir and Gneiting (2010) came up with the same result in a study about wind speed over the North American Pacific Northwest.The difficulty is that the past observations used in the postprocessing are only available at the weather stations.The local model can therefore be applied solely at the stations.To provide spatially comprehensive but locally specific forecasts, the global model is enhanced with topographical information.To enable local effects without binding the model to the stations, the data used to estimate the parameters of the predictive distribution is selected with respect to the topographical similarity between the location it originated and the prediction location.Such approaches (referred to as semi-local) use a specific weighting of the training data for each point of prediction, thus one has to fit the model repeatedly.A similar postprocessing method was introduced by Hamill, Hagedorn and Whitaker (2008) for ensemble temperature forecasts and Lerch and Baran (2018) for ensemble wind speed forecasts.Lerch and Baran (2018) measure the similarity of two locations with distances based on the geographical location, the station climatology and the station ensemble characteristics.Their motivation is to solve numerical stability issues of the local model, which requires long training periods since only the data of one station is used for training.Beside the different motivation, we use other similarity measures and focus on topographical variables that are relevant for postprocessing in an area with complex topography such as Switzerland.
The second target of this paper is to develop a postprocessing method which addresses seasonal specificities that affect postprocessing performance.Our case study highlights that a key difficulty of postprocessing ensemble precipitation forecasts lies in summer and early autumn, when in many places postprocessing leads to a degradation of the forecast quality.To deal with this difficulty, seasonality has been included into the model through covariables, weights and via a novel approach referred to as Pretest.The later first evaluates whether a postprocessing at a given station in a given month is expected to improve forecast performance.A comparison of the performances shows, that for our case study the Pretest approach performs best (see supplementary material for details).
The remainder of this paper is organized as follows: Section 2 introduces the data, the notation and the verification techniques.The elaboration of the postprocessing model is presented in Section 3. Finally, the external model validation is presented in Section 4 and discussed in Section 5.The first dataset (subsequently referred to as Dataset 1) consists of ensemble forecasts and verifying observations for the daily precipitation amount between January 2016 and July 2018 whereby a day starts and ends at 00 : 00 UTC.The data is available for 140 automatic weather stations in Switzerland recording sub-daily precipitation.The second dataset provides the observed daily precipitation amounts of 327 additional stations (on top of the 140 ones of Dataset 1) that record only daily sums.For historical reasons, these daily sums are aggregated from 06 : 00 UTC to 06 : 00 UTC of the following day.For the purpose of uniformity, these daily limits are adopted for all stations in Dataset 2. The larger number of stations from Dataset 2 is only available from June 2016 to July 2019.All stations from Dataset 1 are also part of Dataset 2. The stations of both datasets are depicted in Figure 1.
Since COSMO-E makes forecasts for five days into the future, the different intervals between the forecast initialization time and the time for which the forecast is valid have to be taken into account.This is referred to as forecast lead time.The possible lead times for a prediction initialized at 00 : 00 UTC are 1, 2, 3, 4, 5 days and 1.5, 2.5, 3.5 and 4.5 days for one initialized at 12 : 00 UTC respectively.Figure 2 illustrates the possible lead times for both datasets.For Dataset 2 the forecast lead times increase from 24 hours to 30 hours for the first day due to the different time bounds of aggregation.Also, only four complete forecasts can be derived from each COSMO-E forecast with Dataset 2. We use Dataset 1 reduced to forecast-observation pairs with lead time equals 3 for the elaboration of the methodology.This selection procedure depends strongly on the dataset bearing the danger of overfitting.To assess this risk, the models of choice will be evaluated with Dataset 2. This is done for all lead times between 1 and 4.
In addition to the forecast-observation pairs and the station location (latitude, longitude, and altitude), topographical indices derived from a digital elevation model with 25m horizontal resolution are used.The topographical indices are available on 7 grids with decreasing horizontal resolution from 25m to 31km describing the topography from the immediate neighbourhood (at 25m resolution) to large-scale conditions (at 31km).The topographical indices include the height above sea level (DEM), a variable denoting if the site is rather in a sink or on a hill (TPI), variables describing aspect and slope of the site and variables describing the derivative of the site in different directions.

Notation
In this paper, the observed precipitation amount (at any specified location and time) is denoted as y.y is seen as a realization of the non-negative valued random variable Y .The K ensemble members are summarized as x = (x 1 , ..., x K ).Predictive distributions for Y are denoted by F and take the form of probability density functions (PDFs) or cumulative distribution functions (CDFs).In literature, a capital F is used to refer indistinguishably to the probability measure or its associated CDF, a loose convention that we follow for simplicity within this paper.A forecast-observation pair written as (x i , y i ) refers to a raw ensemble forecast and the corresponding observation.A pair written as (F i , y i ) generally indicates here, on the other hand, that the forecast is a postprocessed ensemble prediction.

Verification
To assess a postprocessing model, the conformity of its forecasts and the associated observations is rated.In the case of probabilistic forecasts, a predictive distribution has to be compared with single observation points.We follow Gneiting, Balabdoui and Raftery (2007) by aiming for a predictive distribution maximizing the sharpness subject to calibration.
Calibration Thorarinsdottir and Schuhen (2018) define a forecasting distribution F as calibrated "if the observation cannot be distinguished from a random draw from the predictive distribution".This can be rated with the Probability Integral Transform F(Y ) (PIT; Dawid 1984) of the random observation Y which should be uniformly distributed over [0, 1] for a calibrated forecast.In practice, the n available forecast-observation pairs (F i , y i ) with i = 1, 2, ..., n out of the test dataset are examined by having a look at the histogram of the PIT values A flat histogram with equally populated bins indicates a calibrated forecasting method.To investigate the calibration of the raw ensemble, the discrete equivalent of the PIT histogram called verification rank histogram is used (Hamill and Colucci 1997).It is generated by ranking the values of every ensemble-observation pair.A histogram of the ranks of the observations shows how they are distributed within the ensemble.Again, a flat histogram indicates a calibrated forecasting method.Hamill (2001) pointed out that the flatness of the PIT histogram is necessary but not sufficient for a forecast to be ideal.Gneiting, Balabdoui and Raftery (2007) took these results as a motivation to aim for a predictor which maximizes the sharpness while being calibrated.Sharpness is a characteristic of the forecast only and does not compare it with the actual observations.Therefore, we evaluate the sharpness only in combination with calibration and not individually.
Accuracy The accuracy of a forecast is assessed with summary measures addressing both calibration and sharpness simultaneously.These functions called Scoring Rules map each forecast-observation pair (F, y) to a numerical penalty, where a smaller penalty indicates a better forecast and vice versa (Thorarinsdottir and Schuhen 2018).Let be a Scoring Rule, where Y is a set with possible values of the quantity to be predicted and F a convex class of probability distributions on Y .The Scoring Rule is said to be proper relative to the class F if where F, G ∈ F are probability distributions and G in particular is the true distribution of the random observation Y .The subscript Y ∼ G at the expected value denotes that the expected value is computed under the assumption that Y has distribution G (Gneiting, Balabdoui and Raftery 2007).
We will focus on one popular proper Scoring Rule: The Continuous Ranked Probability Score (CRPS; Matheson and Winkler 1976).The CRPS is a generalization of the absolute error for probabilistic forecasts.It can be applied to predictive distributions with finite mean and is defined as follows (Thorarinsdottir and Schuhen 2018): where I A (x) denotes the indicator function for a set A ⊆ R which takes the value 1 if x ∈ A and 0 otherwise.To compare the performance of the postprocessing models with the one of the raw ensemble, we need a version of the CRPS for a predictive distribution F given by a finite ensemble x 1 , ..., x K .We use the following definition by Grimit et al. (2006): In practice, competing forecasting models are compared by calculating the mean CRPS values of their predictions over a test dataset.The preferred method is the one with the smallest mean score.We use a Skill Score as in Wilks (2011) to measure the improvement (or deterioration) in accuracy achieved through the postprocessing of the raw ensemble: where the skill Skill(F, x, y) characterizes the improvement in forecast quality by postprocessing (CRPS(F, y)) relative to the forecast quality of the raw ensemble forecast (CRPS(x, y)).

Postprocessing model
The aim of this chapter is to find a suitable postprocessing model by comparing the forecast quality of different approaches on the basis of Dataset 1.

Censored Logistic regression
In this section, we present a conventional ensemble postprocessing approach as a starting point for later extensions.We compared several EMOS models and Bayesian Model Averaging in a case study with Dataset 1 whereby a censored inhomogeneous Logistic regression (cNLR; Messner et al. 2014) model turned out to be the most promising approach.
The performance of the models has been assessed by crossvalidation over the months of Dataset 1.Of the 31 months available, the months are removed one at a time from the training data set and the model is trained with the remaining 30 months.Predictive performance is then assessed based on comparison between the observations at each left-out month from the models trained on the remaining months.The basic models are tested in two versions: For the global version, the model is trained with the data of all stations allowing the later application of the model to all stations simultaneously.The local version requires that models are trained individually for each station with the past data pairs of this station.More details on the alternative approaches to the cNLR model and the results from the comparison of approaches can be found in the supplementary material (Section A).
The cNLR approach is a distributional regression model: We assume that the random variable Y describing the observed precipitation amount follows a probability distribution whose moments depend on the ensemble forecast.To choose a suitable distribution for Y , we take into account that the amount of precipitation is a non-negative quantity that takes any positive real value (if it rains) or the value zero (if it does not rain).These properties are represented in a zero censored distribution.We assume that there is a latent random variable Y * satisfying the following condition (Messner, Mayr and Zeileis 2016): In this way, the probability of the unobservable random variable Y * being smaller or equal than zero is equal to the probability of Y being exactly zero.
For the choice of the distribution of Y * , we have compared different parametric distributions: a Logistic, Gaussian, Student, Generalized Extreme Value and a Shifted Gamma distribution, whereby the Logistic turned out to be the most promising.For the Logistic distribution with location m and scale s, Y * ∼ L (m, s) has probability density function The expected value and the variance of Y * are given by: The location m and the scale s of the distribution have to be estimated with the ensemble members.We note that in the COSMO-E ensemble, the first member x 1 is rendered with the best estimate of the initial conditions whereas the other members are initialized with randomly perturbed initial conditions.The first member is thus not exchangeable with the other members and it is therefore reasonable to consider this member separately.Members x 2 , x 3 , ..., x 21 are exchangeable, meaning that they have no distinct statistical characteristics.
For this reason, they can be summarized by the ensemble mean without losing information (Wilks 2018).Taking this into account, we model the location m and the scale s of the censored Logistic distribution as follows: where x is the ensemble mean and SD(x) is the ensemble standard deviation.The factor that converts the standard deviation of the logistic distribution into the scale is included in γ 1 .The five regression coefficients are summarized as ψ = (β 0 , β 1 , β 2 , γ 0 , γ 1 ).
The predictive distributions F i are censored Logistic with location and scale depending on the ensemble forecasts x i and the coefficient vector ψ.We use the optimization technique called Scoring Rule estimation (Gneiting et al. 2005) CRPS(F ψ i , y i ) refers to the CRPS value of data pair (F ψ i , y i ) where the predictive distribution F ψ i depends on the coefficient vector ψ.We use (F i , y i ) short for (F ψ i , y i ).The weight w i (s) of training data pair (F i , y i ) depends on the similarity between the location it originated and the location s where we want to predict.We set w i (s) = 1 if training data pair i originated in one of the L closest stations to the prediction site s.For the training pairs from the remaining stations, we set w i (s) = 0.This ensures that the training dataset is restricted to the data from the L stations which are most similar to the prediction site s.Consequently, the coefficient vector ψ * (s) 1 Literature knows another kind of weighted CRPS value: Threshold and quantile weighted versions of the CRPS are used when wishing to emphasize certain parts of the range of the predicted variable.The threshold weighted version of the CRPS is given by where u is a non-negative weight function on the real line (Gneiting and Ranjan 2011, consult their paper for the analogous definition of the quantile weighted version of the CRPS).
which minimizes c(ψ; s) depends on the location s.
Following Lerch and Baran (2018), the similarity between locations is quantified with a distance function, which, in our case, is intended to address the topography of the respective locations.From the topographical dataset we have about 30 variables in 8 resolutions at our disposal.To determine which ones to use, we examine the topographical structure of the raw ensemble's prediction error.We compare the observed daily precipitation amounts with the ensemble means and take the station-wise averages of these differences.These preliminary analyses were made with the first year ( 2016) of Dataset 1.The mean prediction errors per station are depicted in Figure 3a.The topographical structure of the ensemble prediction error seems to be linked to the mountainous relief of Switzerland.
In a first approach, we define the similarity of two locations via the similarity in their distances to the Alps.It turns out that such an approach depends on numerous parameters (consult the supplementary material for more details).The proposed alternative is to focus on the variable DEM describing the height above sea level and use the values provided by a grid with low resolution (here 31km horizontal grid spacing).This ensures, that the small-scale fluctuations in the height are ignored such that the large-scale relief is represented.
Figure 3b shows the same station-wise means as Figure 3a, but this time the values of each station are plotted versus their height above the sea level (DEM) in resolution 31km.
The best fit with a polynomial is achieved by modelling the ensemble mean bias as a linear function of the DEM variable, the solid line is depicting this linear function.
The linear dependency visible in Figure 3b motivates the following choice of a distance function to measure the similarity between two locations.Let us define a function DEM which maps a location s to its height above the sea level in the resolution 31km: where D ⊆ R 2 is a set with all the coordinate pairs lying in Switzerland.The similarity of locations s 1 and s 2 is then measured by the following distance: be the ordered distances d DEM (s, s j ) between the m stations from the training data and the prediction location s.Let s i be the location of the station where forecast-observation pair (F i , y i ) originated.Then, the weights with respect to the distance d DEM are defined as: These weights ensure that the data pairs, which originated at one of the L most similar stations, get weight 1 and the remaining get weight 0.
Besides this approach, several other topographical extensions of the cNLR model have been tested (with Dataset 1): The topographical variables have been summarized by Principal Components which have been used as predictor variables in the location estimation of the cNLR model.The glinternet algorithm of Lim and Hastie (2013) has been used to uncover important additive factors and interactions in the set of topographical variables.These variables have been used as predictor variables in the location estimation as well.A more basic weighted approach has been based on Euclidean distances in the ambient (two and three dimensional) space.
All extensions of the cNLR model have been compared with the local and the global fit of this very model.The target of this work is to develop an area-covering postprocessing method.Therefore, the extended and the global models are trained with data where the predicted month and the predicted station are left out.This simulates the situation where postprocessing must be done outside a weather station, i.e. without past local measurements.The training period is set to the last year (12 months) before the prediction month.Consequently, forecasting performance can only be assessed with (test) data from 2017 and 2018.The case study with Dataset 1 showed that all other topographical approaches are less successful than the DEM approach, more details and the results can be found in the supplementary material (Section B).

Seasonal extension
In addition to our efforts to quantify similarities between locations, we also aim to investigate ways of further improving postprocessing outside of measurement networks by accounting for seasonal specificities.To examine the seasonal behavior of the local and the global cNLR model, we focus on their monthly mean CRPS values and compare them with the ones of the raw ensemble.Next, we are interested in whether there are regional differences in the model performance within a month.The global cLNR model is used as we will extend this model afterwards.We plot the maps with the station-wise means of the skill exemplary for the month with the best skill (February 2018) and the one with the worst (July 2017).The maps depicted in Figure 5 show that the skill of the global cNLR model varies between different weather stations.Again, the structure seems to be related to the mountainous relief of Switzerland.We note that for both months the skill in the Alpine region is distinctly higher than in the flat regions.
We use this knowledge to develop an approach which tries firstly to clarify whether the postprocessing in a given month at a given prediction location is worthwhile.The idea is to "pretest" the model with data of similar stations and from similar months by comparing its performance with that of the raw ensemble.
with cardinality H. Let further (x i , y i ) be a forecastobservation pair where the forecast is the raw ensemble.
then the pretesting algorithm decides that the raw ensemble is not postprocessed in the given month at the given location.
On the contrary if 1 then the pretesting algorithm decides that the raw ensemble is postprocessed in the given month at the given location, the fit is done a second time with the whole year of training data.
The Pretest approach has been compared with several other seasonal approaches: In a basic approach, we reduce the training period to months from the same season as the prediction month.Another approach uses the sine-transformed prediction month as an additional predictor variable to model the yearly periodicity.The third approach reduces the training data to pairs which have a similar prediction situation (quantified with the ensemble mean and the ensemble standard deviation).The methodology for the comparison has been the same as for the topographical extensions introduced in Section 3.2.The Pretest approach turns out to be the most promising method for our case study with Dataset 1, more details and the comparison results can be found in the supplementary material (Section B).

Model adjustment
For the subsequent evaluation of postprocessing models with Dataset 2, we select a few postprocessing approaches to document the impact of increasing complexity on forecast quality.We will use the raw ensemble and the local such as the global version of the cNLR model as baselines.Further on, we will evaluate the cNLR model extended by the DEM similarity (cNLR DEM).Finally, we will test this same model extended a second time with the pretest approach (cNLR DEM+PT).For the last two models, we have to fix the amount L of similar stations we use for the topographical extension.For the last model we need to fix additionally the pretesting split.
To determine the amount of similar stations in use, the numbers which are multiples of ten between 10 and 80 have been tested (compare Figure 6).We use the data from August  This chapter presents the evaluation of the different postprocessing models.As already announced, the independent Dataset 2 is used to take into account the risk of overfitting during the elaboration of the methodology.

Methodology
We are interested in the area-covering performance of the models.Therefore, we are particularly interested in the performance at locations which cannot be used in model training (as no past data is available).This is the reason why we assess the models only with the 327 additional stations of the second dataset.None of these stations have been used during the model elaboration in Section 3. When determining L in chapter 3.4 we used a training dataset with 139 stations (139 instead of 140 as we trained without the past data from the prediction station).For this reason we carry on with using only the 140 stations of the first dataset to train the models.This rather conservative approach could be opposed by a Cross Validation over all 467 stations, for which, however, another choice of L would probably be ideal.
The local version of the cNLR model is not able to perform area-covering postprocessing and needs the additional stations in the training from Dataset 2. Despite this, it is fitted and assessed as a benchmark here.We train the models for each of the 327 stations and each month between June 2017 and May 2019.This ensures that we have one year of training data for each month and that we have seasonally balanced results.An individual fitting per station is necessary as the selection of the most similar stations used in the DEM ap-proaches depends on the station topography.The model must also be adapted monthly, as the pretesting procedure (and the training period) depend on the prediction month.
During the postprocessing, we used consistently the square root of the precipitation amount.The CRPS value, which is in the same unit as the observation, refers to this transformation as well.To get an idea of the actual order of magnitude, the values are converted into the original size, in which the precipitation amount is measured in mm 2 .As a first step, 21 forecasts are drawn from the fitted censored Logistic model.
Afterwards, these values and the corresponding observations are squared and the mean CRPS is calculated as for the raw ensemble.

Results
First of all, let us give an overview of the different models.Figure 7 summarizes the average performance of the models over all months.To assess the seasonal performance of the different approaches, the monthly means of the Skill Score are plotted in Figure 8.We use the raw ensemble forecast as reference and depict the results for lead time 1 and lead time 4.For lead time 1, we note that the DEM + Pretest model We have also examined the spatial skill of the models.Therefore, we have compared the station-wise mean CRPS of the models with the one of the raw ensemble.The skill, which is very similar at neighbouring stations, increases in the Alps and is marginal or non-existent in the Swiss plateau.The resulting spatial distribution of the skill looks similar as the mean bias depicted on Figure 3a, the maps are therefore not shown here.Raw ensemble forecasts are often underdispersed and have a wet bias (Gneiting et al. 2005).This holds for the ensemble precipitation forecasts used in this study as well.Figure 9 (right) shows the verification rank histograms for the raw ensemble.Again, we use the data from June 2017 to May 2019 of Dataset 2 for this assessment and depict the results for lead time 1.We note that the histogram for the raw ensemble has higher bins at the left and right marginal ranks and higher bins for the ranks which lie in the first half of 1, 2, ...21.Therefore, it indicates that the ensemble forecasts are underdispersed and tend to have a wet bias.This raises the question whether the results of the DEM + Pretest model are still calibrated.
To be able to evaluate the calibration of the full predictive distribution of the postprocessing models, we do not use the reverse transformation of the square root for this assessment.
where V ∼ U ([0, 1]) and y ↑ Y means that y approaches Y from below.
Figure 9 shows the PIT histograms for the DEM and the DEM+Pretest model.As expected, the PIT histogram of the Pretest model lies between the one of the raw ensemble and the one of the DEM model without Pretest.The first and the last bins, which are higher than the other bins, indicate that the Pretest model is underdispersed.However, it seems much less gravely than for the raw ensemble.Since the remaining tested seasonal approaches produce worse results, this slight miscalibration of the DEM + Pretest model is a disadvantage we have to accept for the moment.Therefore, it appears that the Pretest approach can address the seasonal difficulties of postprocessing ensemble precipitation forecasts.

Conclusion and perspectives
The aim of the presented work was to enable improved probabilistic precipitation forecasts at any place in Switzerland by postprocessing COSMO-E ensemble forecasts enhanced with topographical and seasonal information.During the elaboration of the methodology, a censored nonhomogeneous Logistic regression model was extended step by step.The final model combines two approaches: Firstly, the training data used to fit the regression coefficients is weighted with respect to the similarity between its location of origin and the prediction location.This similarity is measured with a distance based on the topographical variable DEM in a resolution of 31km.Secondly, a seasonal Pretest ensures that the model only postprocesses the raw ensemble forecast when a gain is expected from it -as assessed in the training sample.This final area-covering model is able to outperform a local version of the cNLR model and reduces the mean CRPS of the raw ensemble (depending on the lead time) by up to 4.8%.
The Pretest, which decides whether a postprocessing is worthwhile in a given setting has the disadvantage that the calibration of the resulting forecast is not guaranteed.Yet, although numerous alternative seasonal approaches have been tested, the CRPS of the Pretest model could not be levelled.Therefore, the issue with the slight miscalibration of the final model could not be solved conclusively in this paper.In addition, making a Pretest means that the model must be adjusted twice, which is computationally expensive.But the strength of this approach is that it is fairly universally applicable also to problems outside meteorology -given that one is willing to sacrifice model calibration for accuracy.
There are various directions in which the model could be further expanded: The determination of the number of most similar stations with which the weighted models are trained depends strongly on the dataset.The models use a weighted CRPS-estimator to fit the regression coefficients.The clarification of the asymptotic behaviour of such an estimator would help to determine the ideal number of stations to train with and would make the elaboration of the methodology less sensitive to the data in use.More meteorological information could be added such as covariates describing the large-scale flow.Further meteorological knowledge could also be incorporated by supplementing the DEM-distance with a further distinction between north and south of the Alps, for instance.The scale estimation of the final model is based only on the standard deviation of the ensemble.This estimation could be further extended with additional predictors to ensure that the ensemble dispersion is adjusted with respect to the prediction setting as well (as done for the location estimation in the alternative extension approaches).

Censored GEV and censored SG distribution
The censored GEV and censored SG models are implemented with the R package ensembleMOS by Yuen et al. (2018), whose functions calculate several pieces of the moment estimation internally.For this reason, most parts of the parametrization are given by the structure of the ensembleMOS function, which is used to fit the regression models.For the GEV distribution, we need to estimate the location, the mean and the shape.To link the parameters of the predictive distribution to suitable predictor variables, Scheuerer (2014) (and the ensembleMOS function) add the fraction of zero members in the location, and the ensemble mean difference in the scale, respectively.They are defined as follows: The ensembleMOS function calculates these two quantities internally.To get reasonable results, it is therefore necessary to pass the whole ensemble to the function.To reduce the number of coefficients, the ensemble members x 2 , x 3 , ..., x 21 are specified as exchangeable and summarized by their mean.This leads to the following models for the location and scale: The shape parameter ξ is estimated independently of the ensemble as an intercept.Again, to fit the coefficients the mean CRPS is minimized over the training dataset.
For the censored SG distribution, we need to estimate the shape, the scale and the shift parameter.The shape κ and the scale θ of the distribution are related to the location and the variance in the following way: The value range of the Gamma distribution is non-negative.For this reason, the parameter δ is introduced to shift the distribution to the left and to make the censoring feasible.While this shift parameter is fitted as an intercept, the location and the scale of the distribution are linked to the ensemble in the following way (Scheuerer and Hamill 2015): The ensembleMOS function used to implement the model calculates the ensemble mean x internally.In this case, a reduction of the ensemble to the control member and the ensemble mean would not lead to a nonsense estimator of the ensemble mean.Despite this, the whole ensemble is passed to the function to get comparable conditions as for the censored GEV model.Again, the coefficients are fitted by minimizing the mean CRPS over the training dataset.

Extended Logistic regression
Conventional Logistic regression is dealing with binary data and can therefore not be applied in the postprocessing setting of this paper.Only if an event rather than a measured quantity is predicted, Logistic regression could be applied to postprocess an ensemble forecast.This would be the case if the event "it rains at most q" needs to be described.The probability of this event given that the ensemble forecast, say X, equals x can be modelled as where f (x) is a function of chosen ensemble quantities (Wilks 2009).In what follows, this conditional probability with be denoted P(Y ≤ q|x) for simplicity, by a slight (but common) abuse of notation.Using an appropriate optimization based on the training data, the coefficients of this function f (x) can be determined as for the regression models presented before.The coefficients have to be fitted separately for every q of interest -a large number of parameters emerges.Wilks (2009) suggested to unify the regression functions by using one regression simultaneously for all quantiles.He adds a nondecreasing function g(q) of the quantile to the function f (x) and gets a single equation for every quantile: The coefficients in the functions f (x) and g(q) have to be fitted for a selected set of observed quantiles first, but afterwards, the formula can be applied to any value of q.Wilks (2009) uses g(q) = α √ q, a choice based on empirical tests of different functions.Messner et al. (2014) reformulate Equation ( 29) in the following way: where f (x) = f (x)/α.It follows that the predictive distribution of √ Y is Logistic with location − f (x) and scale 1/α.To add information from the ensemble spread, Wilks (2009) replace 1/α by exp(h(x)) such that the variance of the Logistic distribution can be modelled through a function h(x) of the ensemble as well.
The Extended Logistic Regression (HXLR) model can be implemented with the function hxlr out of the R-package crch by Messner, Mayr and Zeileis (2016).Again, the location of the Logistic distribution of Y is assumed to depend linearly on x 1 and x, the variance on SD(x).The square root transformation of the model presented above is omitted as the whole data has been transformed in this way at the beginning.This leads to the following formula for the probability of Y being smaller or equal than a quantile q: In this paper, the coefficients β 0 , β 1 , β 2 , γ 0 and γ 1 are fitted by minimizing the mean CRPS of the quantiles q 0.1 , q 0.2 , ..., q 0.9 from the training data.
As precipitation is the quantity of interest, we have to make the model feasible by censoring the Logistic distribution after the fitting.

A.2 Bayesian Model Averaging
Bayesian Model Averaging (BMA; Raftery et al. 2005) generates a predictive PDF which is a weighted average of PDFs centred on the single ensemble member forecasts, which have already been bias-corrected (Sloughter et al. 2007).For the estimation of the predictive PDF, every ensemble member is associated with a conditional PDF f k (y|x k ) which can be imagined as the PDF of Y given x k under the assumption that x k is the best prediction of the ensemble.The weighted average is defined as: Sloughter et al. ( 2007) specify the non-negative weight w k as the posterior probability of member k being the best forecast (based on the performance of this ensemble member in the training period).The sum of the weights over all k ∈ K is one, where K is the number of ensemble members.
When modelling f k (y|x k ), the properties of the precipitation amount have to be considered.For this reason, Sloughter et al. (2007) apply a function consisting of two parts: One describes the probability of precipitation as a function of x k , the second specifies the PDF of the amount of precipitation given that it is non-zero.
The implementation of the BMA model is done in the following way: Firstly, the ensemble members have to be bias-corrected with a linear regression: where a k and b k are fitted by minimizing the average square difference between the observations and the corrected members in the training period.
In the following aligns, x k is treated as the corrected version x cor k .The conditional PDF f k (y|x k ) is modelled as: (34) Sloughter et al. (2007) estimate the probability of no precipitation with Logistic regression: where an indicator term for the case that the ensemble member x k is equal to zero is added.To specify the PDF g k (y) of the precipitation amount given that it is non-zero, Sloughter et al. (2007) use a Gamma distribution.The BMA model used in this paper fits the location and the scale of the Gamma distribution as follows: (36) Sloughter et al. (2007) figured out that the variance parameters did not vary much between the ensemble members of their dataset (precipitation) and assumed therefore that they are constant over all members.This is adopted in this paper.
Bayesian Model Averaging is implemented with the R-package ensembleBMA by Fraley et al. (2007).The eponymous function first estimates the coefficients of p k with a Logistic regression over the training data.Then, the coefficients in µ k are estimated by Linear Regression.At last, the coefficients in σ k are fitted by maximum likelihood estimation (Sloughter et al. 2007).To get a model which has a comparable number of coefficients as the other basic models of this paper, the ensemble is reduced to the control member x 1 and the ensemble mean x before applying the BMA model.

A.3 Results
The performance of the basic models is assessed with Dataset 1.A Cross-Validation over the months is applied.Of the 31 months available, one month is removed from the training dataset and the model is trained with the remaining 30 months.After that, the performance is evaluated with the month previously removed.At this point, it is still accepted that the model is trained with future data.The performance of the basic models is evaluated with the CRPS and the PIT histogram.In addition, Skill Scores are used to compare the average CRPS of the models with the one of the raw ensemble.The Skill Scores are averaged per month and per station to get a spatio-temporal impression of the model performance.In this paper, we use the square root of the predictions and the observations for the postprocessing.The assessment here is done with this transformed values such that the magnitude is not the same as in the result section of the main paper, where we reversed the transformation.
The basic models are tested in two versions: A global fit where all stations get the same coefficients and a local fit with separate coefficients per station.For the global version, the model is trained with the data of all stations allowing the later application of the model to all stations simultaneously.The local version requires that each station is trained individually with the past data pairs of this station.To get an idea of the seasonal performance, we calculate the Skill Scores of the postprocessing models for every month m.In our dataset, the months from January 2016 to July 2018 are available.The CRPS serves as measure of accuracy and the raw ensemble is used as reference forecast.We define I(m) = {i ∈ {1, 2, ..., n} : data pair i originated in month m} with cardinality M. Let (x i , y i ) be a forecast-observation pair from month m. x i is denoting a raw ensemble forecast.(F i , y i ) is a same data pair but with a postprocessed forecast.The Skill Score of the postprocessing model for month m is calculated as: 1 In Figure 12, the monthly skills for four of the basic models are depicted, the ones of the remaining three look similarly.We notice that all the postprocessing models have a lack of skill during summer and early autumn: The skill is negative what indicates that the raw ensemble performs better.This will be one of the difficulties to deal with in the extension of the models.The spatial performance of the models can be assessed by having a look at the maps of the station-wise Skill Scores.This time, the mean CRPS values of the postprocessing models respectively the raw ensemble are calculated per station.In Figure 13, the plots of the cNLR, the BMA and the HXLR model are printed.The maps of the other censored nonhomogeneous regression models look similarly.It is evident that the postprocessing performs better in the Alps than in the lowland.For the local version of the cNLR model, the average skill in the lowland is around zero, while the other two local models and all the global models have negative skill there.We will make use of this spatial structure in the later extensions.The PIT histograms of the models are shown in Figure 14.Note that a randomized version of the PIT values has to be used for the censored nonhomogeneous regression models.This is due to the fact that the CDF of a censored distribution has a discrete component.The regression models fitted with the R-package crch provide the flattest histograms (cNLR, cNGR, cNSR and HXLR).The censored Logistic, Gaussian and Student distribution seem to fit distinctly better than the censored GEV (wet bias) and slightly better than the censored Shifted Gamma distribution.The PIT histogram of the BMA model indicates a dry bias in both versions.
Fig. 14: PIT histograms of the basic postprocessing models.

B Extension approaches
This appendix section introduces the alternative extensions for the cNLR model and gives more detail on the approaches presented in the paper.We first present the two basic procedures which have been used to extend the model.Section B.1 deals with analyses of the topographical and seasonal structure of the ensemble prediction error and the models that result from it.In Section B.2, some more basic extension approaches are presented and in Section B.3, all the approaches are compared.
The aim here is to extend the cNLR model by including topographical, seasonal and prediction situation specific information.The term prediction situation describes the by the ensemble given forecasting prerequisites (as for example a classification into wet and dry forecasts based on the ensemble mean and the ensemble standard deviation).In contrast, we use the term prediction setting when referring to all three conditions (topography, season and prediction situation) under which a forecast was created.Two extension procedures are pursued: Location estimation The first approach integrates additional predictor variables in the moment estimation of the Logistic distribution.The location does not longer depend only on the control member x 1 and the ensemble mean x, but also on additional variables describing the topography, the season and the prediction situation of the prediction setting.A location estimator of the following form emerges: where p j are interactions and main effects of the additional variables.The ζ j 's are coefficients and d is the number of additional predictors in use.

Weights
In the second approach, the regression coefficients are fitted by minimizing a weighted version of the mean CRPS which has the following form: where CRPS(F i , y i ; ψ) refers to the CRPS value of data pair (F i , y i ) where the predictive distribution F i depends on the coefficient vector ψ.The weight w i (q) of training data pair (F i , y i ) depends on the similarity between the conditions in which this forecast-observation pair originated and the prediction setting q.Similarities in topography, season and prediction situation have been considered.Lerch and Baran (2018) use such an approach as well, but they have limited themselves to the selection of similar prediction locations calling it a semi-local model.In this study, a time weighting of the training data is added.This is done by attaching weight to data from training days which are similar in seasonality and prediction situation.
The idea behind the weights called NN-weights is the following: The model is trained with a certain number W of the data pairs which originated under conditions most similar to the prediction setting (the nearest neighbours, NN).These data pairs are weighted equally among each other.The similarity of the conditions of origin is measured with a distance d, which depends on the topography, the season or the prediction situation.Let D 1 d (q) ≤ D 2 d (q) ≤ ... ≤ D n d (q) be the ordered distances d(q, q i ) between the actual prediction setting q and the n conditions q i , under which the training data pairs originated.Then, the NN-weights with respect to the distance d are defined as: These weights are a function of the prediction setting q and the conditions of origin of the whole training data set, as the remaining conditions determine, if a setting q i is under the W nearest neighbours.If several training data pairs originated under conditions with the same distances d(q, q i ), then all the corresponding data pairs are used.
The extensions of the cNLR model are compared with the local and the global fit of this very model, the assessment is based on Dataset 1.The target of this study is to develop an area-covering postprocessing method.Therefore, the extended models are trained with data where the predicted month and the predicted station are left out.This simulates the situation where postprocessing must be done outside a weather station, i.e. without past local measurements.The training period is the last year (12 months) before the prediction month.Consequently, the model performance can only be assessed with the data from 2017 and 2018.

B.1 Structure ensemble prediction error
This section presents dependencies of the ensemble prediction error on the topography, season and prediction situation.For this purpose, the precipitation amount (Y s,t ) s∈D,t∈T at location s on day t is expressed as a function f (x s,t , s,t) of the location, the day and the corresponding ensemble forecast x s,t .D ⊆ R 2 is a set with the coordinates of Switzerland and T is a set of dates.To find a suitable function f (x s,t , s,t), the ensemble prediction error is described by approximating the bias of the ensemble mean as a function τ(x s,t , s,t) : R × D × T → R, such that f (x s,t , s,t) = x s,t + τ(x s,t , s,t). (41)

B.1.1 Topographical structure
To get an idea of the topographical structure of the ensemble mean bias, the station-wise means of y s,t − x s,t are illustrated on a map (Figure 3 in the main paper, we use the data from 2016 out of Dataset 1).It seems reasonable to assume that the bias of the ensemble mean depends on the distance between the prediction location and the Alps.For this reason, an "Alp-line" is defined such that it is possible to quantify the distance between a prediction site and the Alps.To define a line, the latitude of the 20 highest stations of Dataset 1 is modelled as a linear function of the longitude.
A linear regression illustrated in Figure 15 (left) provides the coefficients such that the following estimator of the "Alp-line" a emerges (l is the longitude): â : R → R, l → 45.78010 + 0.08678 • l. (42) The signed distance to the Alps of location s is then expressed through: with p A (s) the orthogonal projection of s onto the "Alp-line" â. Figure 15 (right) depicts a scatter-plot of the station-wise means of y s,t − x s,t versus the distance to the Alps of these same stations.To find a function which describes the dependency between those two, the performances of different polynomial fits are compared with F-Tests.There is a significant improvement until the fifth degree, such that the topographical dependency of τ is expressed as The fitted τ(s) is illustrated with the red line in Figure 15 (right).
Fig. 15: Left: Estimator â of the "Alp-line", Right: Scatter-plot of the station-wise means of y s,t − x s,t versus the distance to the Alps of these same stations.
An approach using the distance to the Alpline in the location estimation depends on a lot of coefficients: The number of highest stations used in the fitting of the "Alp-line" has to be determined and the "Alp-line" itself depends on two coefficients.The final location estimation needs seven coefficients more.For this reason, the alternative approach based on the DEM variable has been introduced, the motivating steps are shown in the main paper.
On the one hand, the variable DEM in the resolutions 31km and 15km were added as predictors p j in the location estimation of the censored Logistic distribution, resulting in the following four models: Fig. 17: Station-wise means of y s,t − x s,t versus the distance to the Alps of these same stations after splitting the data into summer and winter months such as dry and wet forecasts.
In the definition of the τ function in ( 53), a first approach to deal with the seasonal difficulties has been introduced.The seasonal extension approach called Sine is based on this, the location of the censored Logistic distribution is fitted by where m is the by π 12 multiplied number of the predicted month and x : sin(m) is the interaction between sin(m) and x.While the Sine model extends the location estimation of the cNLR model, weighted extensions based on the season and the prediction situation have been tested as well.The first weighted approach (called Sit) is another attempt to integrate the observations from Section B.1.2into the model.For this, the data is split into four parts.The allocation Sit of a prediction setting q with ensemble forecast x q to one of these four parts depends on the season and prediction situation of q: Dry summer for q in May-October and x q − SD(x q ) ≤ 0.1, Wet summer for q in May-October and x q − SD(x q ) > 0.1, Dry winter for q in November-April and x q − SD(x q ) ≤ 0.1, Wet summer for q in November-April and x q − SD(x q ) > 0.1. (55) The Sit model puts weight only on training data originating from the same Sit as the prediction setting.Let Sit(q i ) be the allocation of training data pair (F i , y i ) and Sit(q) the one of the predicted setting.The following weights are used in (39): This weighting does not use NN-weights as defined in ( 40), but the next does.The approach named Ensemble characteristics is inspired by the paper of Lerch and Baran (2018).Here, the model training is done with data from days which had a similar ensemble prediction situation.This is quantified using the ensemble mean and the ensemble standard deviation.These two ensemble quantities vary temporally and spatially, which is why the training data cannot be combined into groups.The ensemble similarity between the prediction setting q and the conditions q i , under which training data pair i emerged, is measured with the following distance: The weights used in the Ensemble characteristics approach are defined as in (40) with d(q, q i ) = d enschar (q, q i ).
Finally, the Pretest approach of the main paper is combined with the Sit approach, resulting in Pretest and wet/dry situation.This approach combines the Pretest with the wet/dry part of the Sit approach.It works like the Pretest approach, but first splits the whole data into wet and dry prediction situations.

B.2 Basic extensions
The extension approaches from Chapter B.1 are compared with some more basic or general models presented in the following.The first two approaches extend cNLR by adding topographical predictors to the location estimation, the third is a weighted approach based on the euclidean distance.The fourth approach is a more basic way to introduce some seasonality into the models.
Location estimation: Principle Component Regression (PCA) When extending the location estimation of the cNLR model by topographical variables, two issues have to be faced: The number of available variables is large and the variables are not independent of each other.One possible solution approach is Principle Component Regression (see Maitra and Yan 2008).The idea is to reduce the matrix of possible predictor variables by summarizing the variables to p Principle Components (PC), which are linearly independent of each other.The components correspond to the eigenvectors with the p largest eigenvalues from the spectral expansion of the variance-covariance matrix of the predictor variables, each is a linear combination of the predictors.In this paper, four Principle Component Regressions are applied to the data, they differ in the number of principle components in use: -PCA95: Use principle components explaining 95% of the variance between the topographical variables.This needs p = 34 components.
Scores associated with the p components are added to the covariables for location estimation: with pcs i denoting the score associated with the i-th principle component.
Location estimation: Add interactions with the glinternet algorithm (GLI) Another approach to extend the cNLR model is by adding topographical variables both as additive factors and combined as interactions.To decide which of the variables should be used, the glinternet algorithm is applied -a method for learning linear pairwise interactions satisfying strong hierarchy introduced by Lim and Hastie (2013).
The workhorse of glinternet is a group-lasso regularization, which is a more general version of the lasso.The lasso regularization was introduced by Tibshirani (1996) for a regression set up with the data pairs (x i , y i ), where x i = (x 1 i , x 2 i , ..., x d i ) are the regressor variables and y i the response for observation i with i = 1, 2, ..., n.The ordinary least square (OLS) estimates, which are obtained by minimizing the residual squared errors, often lead to estimates with a low bias but large variance.By reducing the number of predictors, the variance can be decreased and a model which is easier to interpret arises.In a group-lasso regularization, groups of regressor variables are allowed to be selected into or out of a model together.We assume that there are g groups of predictor variables and denote the feature matrix for group k as x k .y denotes the vector of observations.The definition of the group-lasso estimates ( α, β ) with β = ( β1 , β2 , ..., βg ) T by Lim and Hastie (2013) where ||x|| 2 of the vector x 2 i .The parameter λ controls the amount of regularization.Decreasing λ reduces the regularization and increases the amount of predictor variables in use.The γ's allow to penalize the groups of variables to different extents.
The glinternet algorithm of Lim and Hastie (2013) fits a group-lasso on a candidate set of interactions and their associated main effects.For the regularization parameter λ , a grid of values is used.The algorithm starts with the regularization parameter λ = λ max for which all estimates are zero.Then, λ is decreased with the result that more variables and their interactions are added to the model.The model is obeying strong hierarchy such that an interaction can only be present if both of its main effects are present as well.
Implementation: The first argument of the R-function glinternet is a matrix with possible predictor variables, here specified as the topographical variables.The algorithm examines the influence of these variables on a response, which is specified as a second argument of the function.Since we want to investigate the influence of the topography on the prediction error, we select the observed precipitation amount minus the ensemble mean (y s,t − x s,t ) as response variable.The number of variables to find is specified as 20.Again, only the data from 2016 out of Dataset 1 is used to find the predictors.The implemented model tries first to use all main effects and interactions proposed by glinternet as p j .If it runs into numerical optimization problems, λ is increased such that some variables fall out.This procedure is repeated until a model fit is achieved.The 9 first sets of the output of glinternet are used, they are presented in Figure 18.It stands out, that glinternet also attaches the most importance to DEM 31km and DEM 15km , which supports the DEM-variable approach introduced in the main paper.

Weights: Euclidean distance (3DIM and SPA)
This weighted approaches select the training data depending on the euclidean distance between its station of origin and the prediction location.The models are fitted with the training data from the L closest stations (as in the weighted DEM models).The spatial euclidean distance between the prediction location s and the location s j of station j is defined as: d spatial (s, s j ) = (lat(s) − lat(s j )) 2 + (long(s) − long(s j ))  The NN-weights for the SPA and the 3DIM approach are defined as for the weighted DEM models (see Definition (50)) using the distances d spatial (s, s j ) and d 3DIM (s, s j ).Such an approach based on the euclidean distance is also proposed by Lerch and Baran (2018).

Basic seasonality
Principally, the extension models are fitted monthly with the previous year of training data.This would refer to the illustration top left in Figure 19.For comparison, the models are also trained with only one or two months from the same season.These basic seasonality approaches are applied in three versions: The first trains with the month before the prediction month (Figure 19, bottom left), the second with the same month as the prediction month in the year before (top right) and the third with both of these months (bottom right).

B.3.2 Topographical extensions with advanced seasonality
This section compares the advanced seasonal extensions of the DEMw31 and the SPA model.Both models are using NN-weights depending on the prediction location.The regression coefficients are fitted with the training data from a given number L of stations, which are most similar to the prediction location.The following list explains how the different combinations of the DEMw31 model and the advanced seasonal approaches are implemented.
Ensemble characteristics: For a model which combines the DEMw31 weights with the ensemble characteristic weights, a new distance depending on d DEM31 (s, s j ) and d enschar (q, q i ) has to be defined.For this reason, both distances are ranked: -The station j of the training data set with the r-smallest distance d DEM31 (s, s j ) to the prediction location s gets rank R d DEM31 (s, s j ) = r.If some distances are equal, middle ranks are used.-The training data pair i originating under conditions with the r-smallest distance d enschar (q, q i ) to the prediction setting q gets rank R d enschar (q, q i ) = r.
In the case of equal distances, middle ranks are used.
This two ranks are summed to build the following distance: d DEM31+enschar (q, q i ) = R d DEM31 (s, s i ) + R d enschar (q, q i ), where s is the location of the predicted setting q and s i the one where the training data pair i originated.This distance is used to generate NN-weights as in (40).The model, which has to be fitted for every day separately, uses training data from the W = 1 000, 2 000, ..., 6 000 most similar conditions of origin.
Sit and Pretest wet/dry: These two models have to be fitted daily as well.First, the training data is reduced to the data pairs from the L = 10, 20, ..., 80 stations with the smallest distances d DEM31 (s, s j ) to the predicted location.Afterwards, the seasonal approaches are applied as described in Chapter B.1.2.
Sine and Pretest: The implementation of these models works analogous but the fit can be done per month.
Generally, the advanced seasonal extensions of the SPA model are implemented in the same way where d DEM31 (s, s j ) is replaced by d SPA (s, s j ).
Since several best results were obtained with the boundary numbers of the parameters L and W , the variety of the numbers in use had to be widened.2 The table in Figure 21    To compare the monthly performances of the advanced seasonal extensions, the monthly Skill Score is depicted in Figure 22.Again, the CRPS is used as measure of accuracy and the raw ensemble as reference forecast.The figure is limited to the seasonal extensions of DEMw31.On the left, the DEMw31 model without seasonal extension is given as a baseline.We note that the Pretest model is the only approach without negative skill during summer.In return, the timeline of this approach seems to lose the peaks in other months.

Fig. 1 :Fig. 2 :
Fig. 1: The 140 stations of Dataset 1 (points) and the additional 327 stations of Dataset 2 (triangles) Fig. 3: The station-wise means of the observations minus the ensemble means for the data from 2016 of Dataset 1 Figure 4 shows the monthly skill of the global and the local cNLR model.We use the mean over all the stations from Dataset 1 and depict the months between January 2017 and July 2018 such that we have 12 months of training data in each case.A positive skill of 10% means for example that the mean CRPS value of the raw ensemble is reduced by 10% through postprocessing, a negative skill indicates analogously that the mean CRPS value is higher after the postprocessing.The global model has negative skill in February and March 2017 and between June and October 2017.The values are especially low in July and August 2017.The local model has a positive skill in almost all months, the postprocessing with this model decreases the forecasting performance only between June and September 2017.We use these results as a first indication that the postprocessing of ensemble precipitation forecasts is particularly challenging in summer and early autumn.

Fig. 4 :
Fig. 4: The monthly skill of the local and the global cNLR model which compares the monthly mean CRPS value of the model with the one of the raw ensemble, the values describe the reduction or the increase of the mean CRPS value in percent

Fig. 5 :
Fig. 5: The station-wise skill of the global cNLR model for July 2017 and February 2018 which compares the mean CRPS value of the model with the one of the raw ensemble, the values describe the change in the mean CRPS value in percent 2017 to July 2018 (seasonally balanced) from Dataset 1 and choose the number resulting in the lowest mean CRPS.For the cNLR DEM model (no Pretest) we determine L = 40.For the cNLR DEM+PT model, we combine the different pretesting splits with the same numbers for L. The cNLR DEM+PT model with the lowest mean CRPS value uses Pretest 3 and L = 40.

Fig. 6 :
Fig. 6: The mean CRPS values for the cNLR DEM (+ Pretest) models comparing different numbers for L and the different pretesting splits Figure 7 depicts the mean CRPS values for the different postprocessing approaches and lead times.For lead time 1, the global cNLR model is able to reduce the mean CRPS value by 2.3%.A further improvement is achieved by the local and the DEM model, which show equivalent performances and reduce the mean CRPS value by 4.5% compared to the raw ensemble.Even slightly better results are delivered by the DEM + Pretest model which reduces the mean CRPS value by 4.8%.The skill of the global model decreases with increasing lead time.While the skill is still positive for lead times 1, 1.5 and 2, the model performs roughly equally as the raw ensemble for lead time 2.5.From lead time 3, the mean CRPS value of the global model is even higher than the one of the raw ensemble.The local and the DEM model perform about the same for lead times between 1 and 2.5, for lead times above 3 the DEM model performs slightly better.The DEM + Pretest model performs best for all lead times.It reduces the mean CRPS value between 4.8% for lead time 1 and 2.0% for lead time 4. It is noticeable that the DEM+Pretest model achieves a near constant improvement in the mean CRPS of approx.0.07 for all lead times.Relatively -i.e. as a Skill Score -this corresponds to less and less of the total forecast error.We note additionally, that the improvement which is achieved through the extension of the DEM model with the Pretest depends on the lead time.While the Pretest reduces the mean CRPS of the DEM model only by 0.4% for lead time 1, the obtained reduction corresponds to a proportion of 1.7% for lead time 4.

Fig. 7 :
Fig. 7: Mean CRPS values for the raw ensemble and the different postprocessing models dependent on the lead time, the assessment is based on the data from June 2017 to May 2019 of Dataset 2

Fig. 8 :
Fig. 8: Reduction and increase (in %) of the monthly mean CRPS value of the raw ensemble by the different postprocessing approaches, the assessment is done with the data from June 2017 to May 2019 of Dataset 2 Additionally, we have to use a randomized version of the PIT as our predictive distribution has a discrete component (Thorarinsdottir and Schuhen 2018): lim y↑Y F(y) +V F(Y ) − lim y↑Y F(y) ,

Fig. 9 :
Fig. 9: The PIT histograms for the DEM and DEM + Pretest models and the verification rank histogram for the raw ensemble forecasts (the lead time in use is 1)

Fig. 10 :
Fig. 10: Maps depicting the acceptance behaviour of the DEM+Pretest model.The model evaluates for each of the 327 stations and 24 months of the test dataset if a postprocessing the raw ensemble seems worthwhile, the maps show when and where the model does (light point) and does not (dark point) postprocess the raw ensemble (the lead time in use is 1)

First
, the overall performances of the local and global versions of the basic models are compared.As proposed by Thorarinsdottir and Schuhen (2018), a bootstrapped mean of the CRPS values is calculated per model.That means that repeatedly a sample of the same size as the test dataset is drawn (with replacement) from the CRPS values of the test dataset.In this paper, such a sample is drawn 250 times, such that 250 values for the mean CRPS emerge.These values are represented in the Boxplot of Figure 11.The green boxes represent local model fits, the orange global ones and the blue box represents the bootstrapped mean CRPS values of the raw ensemble.It clearly appears that the local versions of the basic models perform better in terms of the mean CRPS value.The censored nonhomogeneous regression models perform best in both fits.Especially for the local version, the cNLR, cNGR and cNSR models perform better than the models using a GEV and a Shifted Gamma distribution.The BMA model has the highest mean CRPS value, which could possibly be due to the fact that the model only provides two ensemble quantities with densities.The HXLR model has the second highest mean CRPS value, it seems that it performs worse than the full version of cNLR.

Fig. 11 :
Fig. 11: Boxplot of the bootstrapped mean CRPS values for the basic models.

Fig. 12 :
Fig. 12: Monthly Skill Score.Compares the monthly mean CRPS values of the basic postprocessing models with the ones of the raw ensemble.

Fig. 13 :
Fig. 13: Station-wise Skill Score.Compares the station-wise mean CRPS values of the basic postprocessing models with the ones of the raw ensemble.

Fig. 19 :
Fig. 19: Illustration of the basic seasonal approaches for January 2018.
compares the overall performances of the advanced seasonal extensions of the DEMw31 and the SPA model.The best performances are achieved by the DEMw31 model combined with the Pretests 12 (pretesting with the month before, Pretest 2 in main paper) and Pretest 1+12 (pretesting with the month before and the same month of the year before, Pretest 3 in main paper).These models level with the local version of the cNLR model and provide an important outcome: We are able to reproduce the mean CRPS value of the local cNLR model with an area-covering method.Again, we assess here with Dataset 1 and the square root transformation of the precipitation amount.In the later validation with Dataset 2 and after inverting the square root, the DEMw31+Pretest model was able to outperform the local model.It was possible to show the same for Dataset 1 after inverting the square root transformation.

Fig. 21 :
Fig. 21: Mean CRPS values of the advanced seasonal extensions of the DEMw31 and the SPA model, based on test data from 2017 and 2018.The number in brackets is denoting with which W (Ensemble characteristics) respectively L (other approaches) the models perform best.

Fig. 22 :
Fig. 22: Monthly means of the Skill Score comparing the different seasonal extensions of the DEMw31 model.

Table 1 :
The properties of Dataset 1 and Dataset 2 For this purpose, the year of training data is first reduced to the data pairs from topographically similar stations, whereby the similarity is measured with the distance d DEM defined in Equation (16).Afterwards, this training dataset is split into two parts: Traintrain and Traintest.The model is adapted a first time with the dataset Traintrain.Afterwards, the performance of this model is assessed with the second part (Traintest) by comparing the mean CRPS of the model with the mean CRPS of the raw ensemble.