Assessing uncertainties in the building of ensemble RCMs over Spain based on dry spell lengths probability density functions
Authors
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s00382-012-1381-5
- Cite this article as:
- Giraldo Osorio, J.D. & García Galiano, S.G. Clim Dyn (2013) 40: 1271. doi:10.1007/s00382-012-1381-5
- 4 Citations
- 281 Views
Abstract
Spain is one of the European countries with most environmental problems related to water scarcity and droughts. Additionally, several studies suggest trends of increasing temperature and decreasing rainfall, mainly for the Iberian Peninsula, due to climate variability and change. While Regional Climate Models (RCM) are a valuable tool for understanding climate processes, the causes and plausible impacts on variables and meteorological extremes present a wide range of associated uncertainties. The multi-model ensemble approach allows the quantification and reduction of uncertainties in the predictions. The combination of models (RCM in this case), generally increases the reliability of the predictions, although there are different weighting methodologies. In this paper, a strategy is presented for the building of non-stationary PDF (probability density functions) ensembles with the aim of evaluating the spatial pattern of future risk of drought for an area. At the same time, the uncertainty associated with the metric used in the construction of the PDF ensembles is assessed. A comparative study of methodologies based on the application of the Reliability Ensemble Averaging (REA), assessing its factors using two performance measures, on the one hand the Perkins Score Methods, on the other hand the Kolmogorov-Smirnov test, is proposed. The evaluation of the sensitivity of the methodologies used in the construction of ensembles, as proposed in this paper, although without completely eliminating uncertainty, allows a better understanding of the sources and magnitude of the uncertainties involved. Despite the differences between the spatial distribution results from each metric (which can be in the order of 40 % in some areas), both approaches concluded about a plausible significant and widespread increase throughout continental Spain of the mean value of annual maximum dry spell lengths (AMDSL) between the years 1990 and 2050. Finally, the more parsimonious approach in the building of ensembles PDF, based on AMDSL in peninsular Spain, is identified.
Keywords
Climate changeRegional climate modelsMaximum dry spell lengthNon-stationary analysisEnsemble PDF1 Introduction
Climate variability and change present deep impacts over both human socioeconomic activities and ecosystems. More severe and frequent hydro-meteorological extreme events suggest that several hydrological variables are reaching critical thresholds, responsible for sudden and negative impacts rather than a gradual change. Impacts on economic activities, biologic and human health are related with climate variability and change; these impacts span from a damaging effect on farm production, the increases of climate refugees, deep disorder to ecosystems, water resources scarcity and faster spread of vector diseases (Räisänen and Palmer 2001; Tebaldi and Sansó 2009; WHO 2009).
From observed datasets, a rainfall increase was observed in the north of Europe while there was a decrease in the West Mediterranean Basin (Paredes et al. 2006). These rainfall trends have been confirmed by climate model projections (Räisänen and Palmer 2001; Giorgi and Mearns 2002; Giorgi et al. 2004; Tebaldi et al. 2005). Considering temperature data from Global Climate Models (GCM) for the time reference period 1961–1990 and the prediction for 2071–2100, several authors (Giorgi et al. 2004; Tebaldi et al. 2005) have identified a generalized rise of temperatures over the Iberian Peninsula, especially in summer. Furthermore, a plausible decrease of rainfall greater than 15 % over the Iberian Peninsula for the wet season was identified (Giorgi et al. 2004). There are several efforts for the inter-comparison of Regional Climate Models (RCMs) results over Europe. Among them the works of Jacob et al. (2007) and Christensen and Christensen (2007) assessing the ability of RCMs, within the context of European project PRUDENCE, to simulate the long-term mean climate and the inter-annual variability. These authors were working with near surface temperature and precipitation.
The GCM are considered the only tools that can take into account the complex set of procedures which control the climate. Nowadays, the GCM are the worthiest source of data regarding future climate change at global scale, and about the change in frequency and severity of extreme events (Murphy et al. 2004; Sánchez et al. 2009). However, the models are subject to errors now and in the future and this gives rise to uncertainties. These uncertainties identified in climate modelling are usually associated to the initial condition, boundary conditions, parameterization and, finally, structural uncertainties (Tebaldi and Knutti 2007). The uncertainties are principally caused by unreliable projection of greenhouse gases (GHG), highly related with doubts regarding world population growth, future economic and technological development, progress in international cooperation agreements, as well as a lack of understanding of the climate system, the intrinsic randomness the process involves and current modelling constraints, among other causes (Sánchez et al. 2009; Tebaldi and Sansó 2009). Fitting a particular model to reproduce the mean values, the variability and the trends in observed data makes sense in order to reduce its own uncertainties. Regardless of the fact that confidence in models simulation has notably increased, a suitable simulation of observed data is no guarantee about the model projections (Tebaldi and Knutti 2007). Consequently, a strategy is needed in order to assess the uncertainty of climate projections, due to stakeholders often facing several simulations of unknown modelling quality. Due to uncertainties, a probabilistic forecast seems more valuable than a deterministic approach. Therefore, the assessment of uncertainty from projections using a multimodel ensemble approach makes useful tools available for planning adaptation and mitigation strategies.
If a qualitative definition of drought from a social point of view is used, it would be possible to say that drought is a recurrent phenomenon in Spain. The sentence is supported by the fact that drought threatens the supply and irrigation systems, the main water user in Spain. Nevertheless, natural cycles of drought should not seriously affect the native vegetation, which has already adapted to the complex environment. In this context, it could be questioned if the “social perception” of more frequent and stronger droughts is caused by a real trend to increase, or is a misunderstanding caused by the increasing human pressures on water resources. If the RCM predict real changes in droughts, the next question should be to ask if the results depend on weights of members in the ensemble. This work tries to provide answers to both questions.
In accordance with the works of several authors (Sánchez et al. 2009; Tebaldi and Knutti 2007), for the evaluation of uncertainties, a multimodel ensemble approach is needed. Tebaldi and Knutti (2007) suggest that the model combination generally increases the projection reliability. The main assumption of the ensemble approach is the independence between models; therefore the uncertainty should diminish as the number of available models increases. Nevertheless, this assumption could in fact be not true at all, because there are shared processes or methodologies between various models (spatial resolution, parameterization, observed dataset to fit the model, numerical methods and their deficiencies, etc.). The assumption of independency between models implies that the random errors tend to cancel out, however systematic biases in various models will be inherited by the ensemble. Another important issue is the “opportunity ensemble”, which means there are non-scientific aspects that define the ensemble members and their characteristics, giving rise to a non-random and non-systematic sample of models (Tebaldi and Knutti 2007). Finally, the scientific groups usually try to fit their models to reproduce the observed datasets; therefore the models are not really designed to span the whole uncertainty range.
Basically, the combination of members in the multimodel ensemble can be done in two ways. The first one is by neglecting the different reliability of models, and weighing them all equally (Murphy et al. 2004). The other way is to use weighted averages, where the model weight depends on some measure of performance. The main question is to define the metric for measuring model performance, because there is not just one single way of assessing this. Several works have faced the problem from different points of view (Räisänen and Palmer 2001; Giorgi and Mearns 2002, 2003; Dettinger 2005; Tebaldi et al. 2005; Sánchez et al. 2009). But sometimes, the conclusions can be opposite due to the uneven spread of results (Tebaldi and Knutti 2007).
The selection of a particular metric is pragmatic and mainly subjective. The Reliability Ensemble Averaging (REA) method (Giorgi and Mearns 2002), has been selected in the present work to compute the models weight using the empirical probability distributions from observed data. The REA method rewards with higher weight both models with great skill to simulate the observed data distribution, and models closest to the “ensemble consensus” in the future; while the models farthest from these two criteria are penalized. Weigel et al. (2010) highlight that equally weighted multimodels on average outperform the single models, and that projection errors can be further reduced by applying model weights according to some measure of performance.
While, Xu et al. (2010) updated the definition of the REA method, although leaving the use of the convergence criterion, and including multiple variables and statistics in the formulation. The criticism with the convergence criterion was that it could produce an artificial narrowing on ensemble PDF, and some tails and extreme values will be lost. On the other hand, the criticism regarding the use of a single variable (for example precipitation) arises because it could produce a weak assessment about the model’s reliability. The present work overcomes this artificial narrowing using future empirical PDF, which are built using all RCMs data to compute the convergence factor. Finally, the annual maximum dry spell lengths (AMDSL) are extreme values, which should be fit separately from other variables, because they are an independent population.
The issues previously exposed encourage the use of RCM to build ensemble probability distributions for studying the drought phenomenon. The GCM are not able to recognize regional heterogeneities of climate, therefore they are not suitable for building small scale projections, which are needed in impact studies (Paeth et al. 2011). The dynamical downscaling provided by RCM could be used to undertake this task at basin scale (Karambiri et al. 2011). Sánchez et al. (2009) used data from RCM forced by ERA40, to build ensemble CDF (Cumulative Distribution Functions) of seasonal rainfall, considering a regional approach over Europe. García Galiano and Giraldo Osorio (2010) presented a good example of the application of RCM data at basin level to study impacts on extreme events of rainfall in the Senegal River Basin (West Africa).
In the present work, the chosen hydrologic variable is the AMDSL, considering a dry spell as the number of consecutive days with rainfall below a threshold. The threshold was set at 1 mm/day. The dry spells have great interest because they are directly related to pronounced dryness of Spain’s landscapes and rainfall zone gradient, which have affected the historic decisions and discussions about the nationwide planning of water resources. A pioneering work about dry spell analysis in Spain was presented by Martín-Vide and Gómez (1999), who describe the regionalization of Spain’s territory based on fitted Markov chains to time series of dry spell. Even though the fit was not entirely satisfactory (it was not good enough on southeast of Spain), the analysis identified a clear zonal gradient of dry spell lengths, which is increasing southward. Sánchez et al. (2011) have performed a dry spell analysis using observed and simulated precipitation grids from the Iberian Peninsula, with both RCM and GCM. According to that work, the drought periods will increase throughout almost all of Spain’s territory. Moreover, the change will be greater in the south of the Peninsula, which will increase the latitudinal gradient compared with the current climate.
In contrast to Sánchez et al. (2011), the current work focuses on ensemble PDF building in order to analyze plausible trends of maximum dry spell. The first objective in this work is to outline the spatial distribution of several statistics estimated from the ensemble PDF, so they are built on all sites in the study area. The dynamic assessment of trends in the study variable (in this case of AMDSL) is enabled by fitting non-stationary probabilistic models whose parameters change over time. This is the main difference from other works, where the time series are split in time-windows to compute the change between slices, or are directly managed assuming stationary parameters.
The definition of REA (Giorgi and Mearns 2002) has been used as the metric to compute the member weights in the ensemble, in order to combine the fitted non-stationary distributions. The REA criteria were estimated from empirical probability distributions of AMDSL in 1961–1990 (the model performance criterion), and 2021–2050 (the model convergence criterion) time periods.
The second objective of the work is to quantify the influence over results, due to the selected metric to compute the member weights in the ensemble. Thus, each REA component has been computed taking into account two performance measures: the Perkins score (S_{SCORE}, Perkins et al. 2007), and the p value from the two-sample Smirnov-Kolmogorov (TSSK) goodness of fit test. The proposed methodology tries to span the whole range of uncertainty, through the building of ensemble PDF. The empirical distributions of AMDSL are used to compute the member weights in the REA method. The aim is to find the suitable tuning that enables the ensemble PDF to be able to better reproduce the observed distribution of data, instead of only trying to simulate their mean or the standard deviation. It must be highlighted that the proposed methodology enables a dynamic performance of ensemble PDF, because it inherits the simulated variability from each ensemble member, through the fitting of non-stationary distributions to AMDSL time series.
2 Study area and datasets
2.1 Study area
The climate of most of continental Spain is characterized by a dry period in July and August, which is particularly intense in the southern half of the Iberian Peninsula, and a period of rainfall during the winter months (DJF) mainly on the Cantabrian Coast (Autonomous regions of Galicia, Asturias, Cantabria and Basque Country). The Levante area (Autonomous regions of Murcia and Valencia) present a bimodal rainfall cycle, with high values in the months of April–May and October to November, and dry periods in winter and especially in summer. Several authors (Paredes et al. 2006) explore the causes of spatiotemporal variability of precipitation in Spain.
2.2 Datasets
Datasets of daily rain: observed data (Spain02/v2.1) and selected RCMs from ENSEMBLES project
Name | Institute | GCM | RCM | Temporal cover |
---|---|---|---|---|
Spain02/v2.1 | UC^{a} | Observed data | 1950–2007 | |
C4IRCA3 | C4I^{b} | HadCM3Q16 | RCA3 | 1951–2099 |
CNRM/RM5.1 | CNRM^{c} | ARPEGE RM5.1 | Aladin | 1950–2100 |
DMI/ARPEGE | DMI^{d} | ARPEGE | HIRHAM | 1951–2100 |
DMI/BCM | DMI | BCM | DMI-HIRHAM5 | 1961–2099 |
DMI/ECHAM5-r3 | DMI | ECHAM5-r3 | DMI-HIRHAM5 | 1951–2099 |
ETHZ/CLM | ETHZ^{e} | HadCM3Q0 | CLM | 1951–2099 |
METO_HC/HAD | HC^{f} | HadCM3Q0 | HadRM3Q0 | 1951–2099 |
ICTP/RegCM3 | ICTP^{g} | ECHAM5-r3 | RegCM3 | 1951–2100 |
KNMI/RACMO2 | KNMI^{h} | ECHAM5-r3 | RACMO | 1950–2100 |
METNO/BCM | METNO^{i} | BCM | HIRHAM | 1951–2050 |
METNO/HadCM3Q0 | METNO | HadCM3Q0 | HIRHAM | 1951–2050 |
MPIM/REMO | MPI^{j} | ECHAM5-r3 | REMO | 1951–2100 |
OURANOS/MRCC4.2.1 | OURANOS^{k} | CGCM3 | CRCM | 1951–2050 |
SMHI/BCM | SMHI^{l} | BCM | RCA | 1961–2100 |
SMHI/ECHAM5-r3 | SMHI | ECHAM5-r3 | RCA | 1951–2100 |
SMHI/HadCM3Q3 | SMHI | HadCM3Q3 | RCA | 1951–2100 |
UCLM/PROMES | UCLM^{m} | HadCM3Q0 | RRCM | 1951–2050 |
3 Overview
The empirical PDF on 2021–2050 were used to compute the model convergence criterion (R_{D}) trough the convergence analysis. Using both R_{B} and R_{D}, the reliability factor (R) and the normalized reliability factor (Pm) were obtained. Following the non-stationary analysis in the flow chart, the time series between 1961 and 2050 were considered to fit non-stationary PDF for each RCM using GAMLSS. These non-stationary PDF were used later to build the on-site ensemble PDF, using the normalized reliability factor Pm as the weighting factor.
3.1 Time series of annual maximum dry spell length (AMDSL)
Considering all the sites defined in continental Spain (Fig. 1a), the time series of length of dry spells (for P < 1 mm/day), were obtained from the daily observed rainfall dataset. In the study area, maximum lengths of dry spells greater than 1 year (365 days) were not identified. Therefore, a maximum dry spell length for each year (or annual maximum dry spell length, AMDSL), could be considered.
4 Computation of reliability factors
The computations of weighting factors R_{B} and R_{D} is based on empirical cumulative distribution functions (e-CDF), and two quantitative measures to compare the agreement among the probability functions. The Weibull plotting position formula was used to compute the empirical quantiles of e-CDF. The first metric is the p value from the well-known two-sample Smirnov-Kolmogorov goodness of fit test (hereafter TSSK test; Sheskin 2000; Gibbons and Chakraborti 2003). The second metric corresponds to the skill score (S_{SCORE}) proposed by Perkins et al. (2007), measuring the common area under the PDF curves.
In the case of the estimation of model performance criterion R_{B}, the e-CDF were built from observed data and from RCM over the 1961–1990 time period. The PDF from the observed data represents the “reference” in this period. For the model convergence criterion R_{D}, the difficulty is that there is no known reference PDF for future climate. According to Giorgi and Mearns (2002), an iterative process is followed to obtain the estimated PDF and therefore to estimate R_{D}. The estimated PDF was built using bootstrapping techniques with N = 10,000 data, considering the simulated series for the models between 2021 and 2050 (30 years). Initially, the reference PDF is built assigning equal weights to all RCM (that is, each model consists of 10,000/17 ≈ 588 data, obtained from sampling with replacement from the simulated series of 30 years). Then, the distance of each RCM to the estimated PDF is calculated and consequently the assigned weights are readjusted. This procedure converges quickly after some iteration. It should be noted that the PDF built in this way is only an estimate of the distribution of the AMDSL of future climate projection. In accordance with Giorgi and Mearns (2002) and Giorgi and Mearns (2003), the REA average does not represent the actual climate response to the climate forcing scenarios; however the REA average represents the best estimate of this response.
In this work, the TSSK p value and the Perkins S_{SCORE} have been used to compute the reliability factors (R_{B} and R_{D}) and, consequently, R and Pm, giving different measures of reliability that will be discussed later.
4.1 Two-sample Smirnov-Kolmogorov goodness of fit test
4.2 Perkins skill score
4.2.1 Non-stationary analysis of AMDSL time series
The stationarity of hydrometeorological time series cannot be guaranteed in the target area, therefore a methodology for the modeling of time variation of PDF parameters is encouraged. In the present work, GAMLSS tools are applied, assuming parametric distributions of probabilities for the explained variable (in this case, the explained variable is Y = AMDSL). The PDF parameters have been modeled as a function of the explanatory variable (time t), using cubic spline as smoothing functions. Rigby and Stasinopoulos (2005) and Stasinopoulos and Rigby (2007), present a detailed discussion regarding the selection and fitting of statistical model using GAMLSS tools.
The number of parameters used to fit statistical models depends on the chosen distribution, but it is usually less than four (the first parameter for location, the second for scale, and finally the third and fourth are shape parameters). In the present work, distributions with more than two parameters are not justified due to the short length of records (90 annual observations for 1961–2050 time period), hampering the fitting of shape parameters. Therefore, four distributions of two parameters, widely used for statistical modeling of hydrologic series, were taken into account: Gamma (GA), Gumbel (GU), Lognormal (LN), and Weibull (WEI). The relationship between both the first and second distribution parameters with E[Y] and Var[Y] is explained by Stasinopoulos et al. (2008).
In accordance with the procedure suggested by Stasinopoulos and Rigby (2007), the models were fitted considering the Schwarz Bayesian Criterion (SBC), which uses the penalty k = log(n), limiting the effective degrees of freedom to λ ≤ 4. The value of λ is obtained for each distribution considered in every site. The best distribution was selected according to the minimum value of SBC. The independence and normality of the randomized quantile residual are used to ensure that the selected model adequately describes the data, estimating the mean, variance, skewness, kurtosis, and the Filliben correlation coefficient (Filliben 1975). Additionally, visual inspections of qq-plot and worm plot (not shown), proposed by van Buuren and Fredriks (2001) were performed to verify the residuals normality.
4.2.2 On site ensemble PDF and interpolated maps of AMDSL
Non-stationary PDF, associated with each RCM, were fit. Afterwards, ensemble PDF were built on grid site, using the information provided by the maps of normalized reliability factors Pm. For building on-site ensemble PDF, sampling with 10,000 values extracted from non-stationary PDF of each RCM were considered. For example, if Pm_{i} = 0.25, then the non-stationary PDF fitted to RCM_{i} contributed with 2,500 values to the final ensemble PDF. Since the ensemble PDF was built using distributions with non-stationary parameters, the procedure was repeated for defining the final PDF each 10 years (1961, 1970, 1980, and so on until 2050).
From the ensemble PDF several spatial distributions of statistics with their respective 95 % confidence intervals (CI) were computed, using bootstrapping techniques (Efron and Tibshirani 1993).
5 Results analysis
5.1 Relationship between TSSK p value and Perkins skill score
The median values of R_{B} of every RCM, computed from the box-plot, were used to build Fig. 4c. The picture reveals the main difference between the metrics used: the TSSK p value is a “steeper” metric than the S_{SCORE}, so it has values through various scales (the R_{B}–TSSK axis is in logarithmic scale), while the S_{SCORE} values are contained in a comparable scale. Figure 4c shows a strong relationship between R_{B}–TSSK p value and R_{B}–S_{SCORE}. The scatter plot of R_{B} median values indicates that the RCM with better skill to simulate the observed data are DMI/ARPEGE and CNRM/RM5.1, while the models with the worst performance are DMI/BCM, SMHI/BCM and METNO/BCM.
RCM | R_{B} | R_{D} | R | Pm | ||||
---|---|---|---|---|---|---|---|---|
S_{SCORE} | TSSK p value | S_{SCORE} | TSSK p value | S_{SCORE} | TSSK p value | S_{SCORE} | TSSK p value | |
C4IRCA3 | 0.840 | 2.94E−07 | 0.914 | 2.54E−03 | 0.761 | 1.25E−11 | 0.061 | 0.0052 |
CNRM/RM5.1 | 0.858 | 5.44E−03 | 0.882 | 1.15E−07 | 0.754 | 3.3E−11 | 0.062 | 0.0013 |
DMI/ARPEGE | 0.863 | 5.71E−02 | 0.872 | 2.06E−09 | 0.749 | 8.00E−12 | 0.061 | 0.0117 |
DMI/BCM | 0.725 | 9.24E−30 | 0.887 | 3.43E−11 | 0.642 | 8.16E−42 | 0.051 | 0.0147 |
DMI/ECHAM5-r3 | 0.773 | 4.30E−19 | 0.919 | 3.70E−03 | 0.710 | 8.38E−23 | 0.057 | 0.1750 |
ETHZ/CLM | 0.847 | 2.40E−04 | 0.886 | 4.51E−07 | 0.748 | 1.13E−11 | 0.061 | 0.0174 |
METO HC/HAD | 0.832 | 9.08E−08 | 0.919 | 7.93E−03 | 0.761 | 3.12E−12 | 0.061 | 0.0975 |
ICTP/RegCM3 | 0.790 | 2.64E−15 | 0.927 | 3.62E−02 | 0.735 | 1.14E−17 | 0.059 | 0.0014 |
KNMI/RACMO2 | 0.839 | 1.22E−05 | 0.912 | 1.78E−02 | 0.764 | 2.04E−08 | 0.062 | 0.0830 |
METNO/BCM | 0.751 | 3.07E−23 | 0.901 | 2.82E−06 | 0.684 | 3.63E−29 | 0.054 | 0.0163 |
METNO/HadCM3Q0 | 0.818 | 1.19E−10 | 0.928 | 1.06E−01 | 0.755 | 1.14E−12 | 0.060 | 0.0129 |
MPIM/REMO | 0.836 | 1.24E−08 | 0.930 | 5.97E−01 | 0.775 | 1.28E−09 | 0.062 | 0.1820 |
OURANOS/MRCC4.2.1 | 0.814 | 2.32E−11 | 0.931 | 4.23E−01 | 0.758 | 3.21E−13 | 0.061 | 0.0454 |
SMHI/BCM | 0.738 | 7.57E−25 | 0.889 | 3.12E−09 | 0.657 | 1.70E−34 | 0.053 | 4.70E−05 |
SMHI/ECHAM5-r3 | 0.818 | 1.47E−10 | 0.934 | 4.88E−01 | 0.762 | 2.42E−11 | 0.061 | 0.0114 |
SMHI/HadCM3Q3 | 0.783 | 5.10E−17 | 0.925 | 2.87E−02 | 0.727 | 1.39E−18 | 0.058 | 0.0117 |
UCLM/PROMES | 0.816 | 2.37E−11 | 0.924 | 7.15E−02 | 0.753 | 2.02E−13 | 0.060 | 0.1490 |
Σ | 1.003 | 0.8360 |
Two metrics to compute the member weights in the ensemble PDF have been presented, whose results are appreciably different: the TSSK p value is a steeper metric which only takes into account the better models to build the ensemble PDF, while the S_{SCORE} uses virtually all members, with slightly greater weights for the better models. The following sections will analyze the effects on computation of the mean, standard deviation and some centile values, using the previously presented metrics to build the ensemble PDF for every site.
5.2 Spatial distributions
5.2.1 Reliability factor maps
The upper frames in the picture (Fig. 8a), show the maps built with the S_{SCORE}. The R_{B}–S_{SCORE} map presents higher values (blue in color) in Galicia and on the Central Plateau of the Iberian Peninsula (Autonomous regions of Castilla La Mancha, Extremadura, and Madrid), which indicates that the set of models exhibits a lesser bias in this area. This bias increases eastward, with the lesser values of R_{B}–S_{SCORE} for the Levante area. The R_{D}–S_{SCORE} map does not present a meaningful variation of spatial distribution, in accordance with the large convergence in the future of every model in the study area. The lesser values of R_{D}–S_{SCORE} are located along the Mediterranean Sea coastline. The R–S_{SCORE} map is strongly influenced by the spatial changes on the R_{B}–S_{SCORE} map, so the lesser values of R–S_{SCORE} are located in the southeast of Spain too. All the previous maps present values of comparable magnitude scale.
The bottom frames of Fig. 8b show the maps built with the TSSK p value. The R_{B}–TSSK p value map shows a spatial variation which looks like the distribution presented by the R_{B}–S_{SCORE} map. However, the range of values encompasses from less than 0.001 (northeast of Spain) until greater than 0.1 (West Andalusia). The fact that the values are spread in several scales makes more discernible the areas where, in general, the set of models has a good or deficient performance. As a result, there is a poor agreement between the observed data and the set of models for the north and northeast of Spain, while the mean bias is lower in west Galicia and in the southwest of Spain. The R_{D}–TSSK p value map reveals a zonal gradient, decreasing eastward. Nevertheless, there are large values of future convergence of the set of models in the study area, if they are compared against the confidence value α = 0.05 (generally accepted as a good fit between probability distributions). The R_{D}–TSSK p value map does not have scale variations as large as those of the R_{B}–TSSK p value map, but they are greater than the variations in the R_{D}–S_{SCORE} map. Lastly, the R–TSSK p value map shows lesser values in the upper half of the Iberian Peninsula, and meaningful scale variations (from lesser than 0.001 until greater than 0.01). The last feature will be important when the normalized reliability factor is computed: the poor performance of the set of models in the north of Spain could lead to computing deceitfully large Pm values, associated with models which are not important in other areas.
5.2.2 Pm factor maps
From the results summarized in Table 2, the models with best performance, according to Pm–S_{SCORE}, are identified (CNRM/RM5.1, DMI/ARPEGE, KNMI/RACMO2, MPIM/REMO and SMHI/ECHAM5-r3). These RCMs exhibit spatial distributions without a clear spatial pattern (Fig. 10), with values ranging between 0.055 and 0.065. The models with worst performances, from Table 2 (DMI/BCM, METNO/BCM and SMHI/BCM), present heterogeneous spatial distributions, with the value range spread between 0.040 and 0.060. The lowest values of these three models are located on the northeast quadrant of Spanish territory. According to Figs. 7a and 9, every model contributes with 4–7 % to the ensemble PDF in the analysis sites.
The maps of Pm–TSSK p value (Fig. 10) exhibit greater spatial variation than Pm–S_{SCORE} maps. The better models, according to Pm–TSSK p value in Table 2 (DMI/ECHAM5-r3, KNMI/RACMO2, METO HC/HAD, MPIM/REMO and UCLM/PROMES), show significant spatial variations. From the above five models, only two (KNMI/RACMO2 and MPIM/REMO) show an outstanding performance according to the R factor (see Table 2, R–TSSK p value column). However, two of them could be considered models with mean performance (METO HC/HAD and UCLM/PROMES), or even be among the poorer performance models (DMI/ECHAM5-r3). A question arises as to the explanation for this situation. The answer is the uneven spatial distribution of the REA value of the set of models (R–TSSK p value in Fig. 8). From Fig. 8, every RCM has bad performance in the northern half of Spain. Consequently, high Pm values computed from some models (particularly, DMI/ECHAM5-r3, METO HC/HAD and UCLM/PROMES) are not an indication of outstanding reliability of particular models. On the contrary, this reveals the lacking of reliability of the set of models in these Spanish territory areas.
5.2.3 Maps of interpolated statistics from ensemble PDF
Once Pm maps are calculated for each RCM (whatever metric is used, S_{SCORE} or TSSK p value), the ensemble PDF for each site of the study area was built.
It has been shown that, depending on the metric chosen, the contribution of each RCM to the final ensemble PDF can differ significantly. Therefore, the maps constructed for different statistics are expected to reflect these differences. The selected statistics and their confidence intervals to 95 %, were calculated using bootstrapping techniques.
The right column of Fig. 11 presents the calculated change maps (difference 2050–1990). From them and for both metrics, significant increases in the average value of the future horizon AMDSL are predicted for virtually all the Spanish territory, except for some areas of the central plateau; northeast of the Iberian Peninsula; and the Cantabrian Coast. The maps of differences between the metrics (Fig. 11c) show a similar spatial distribution for the selected years.
The positive values (values computed with higher S_{SCORE}) are concentrated in the northeast quadrant of the Iberian Peninsula; the Cantabrian Coast; and the central plateau, while negative values (the higher values from the TSSK p value outcome) are located in the South and Southeast of the Iberian Peninsula. Despite the differences between the maps for each metric (which can be up to 40% in some areas), both approaches concluded that there will be a significant and widespread increase throughout continental Spain of the mean value of AMDSL between 1990 and 2050.
6 Discussion of results
Two metrics (S_{SCORE} and TSSK p value) have been used to compute the models weighting factor on the ensemble PDF (Pm in eq. 2), through the REA method. The Pm value should always be in the [0,1] interval, but the computed values with the S_{SCORE} metric are in the same scale (0.35–0.90, as was presented in Fig. 7a), while the values computed with the TSSK p value metric, encompass several magnitude scales (from lower than 0.001 until greater than 0.1, Fig. 7b).
Several scatter plots have been presented in order to analyse the relationship between the S_{SCORE} and TSSK p value. From these scatter plots, the models were prioritized according to their performance for modelling bias from observed data (R_{B}, Fig. 4c), the future convergence (R_{D}, Fig. 5c), and taking into account the two previous criteria together (R, Fig. 6c). According to R_{B}, the less-biased models are DMI/ARPEGE and CNRM/RM5.1 (both driven by GCM ARPEGE), on the other hand DMI/BCM, SMHI/BCM, and METNO/BCM are the more biased models. However, the R_{D} factor presents the DMI/ARPEGE RCM as the farthest one from the future general agreement. Therefore, the fact of future convergence of models does not guarantee that the biases are small compared with the observed data. Finally, the reliability factor R indicates that KNMI/RACMO2 and MPIM/REMO are, in general, the more suitable models to simulate AMDSL in Spain. However, DMI/BCM, SMHI/BCM and METNO/BCM RCM (the same as R_{B} factor) show, in general, the least skill.
In spite of the different magnitude scales of results, the metrics agree that the set of models exhibits lesser bias on areas located in Southwest quadrant of Spain and Galicia (R_{B} factor is higher, left column Fig. 8), while the observed data are worse simulated by the set of models for the North and Northeast. As for the future convergence maps, the R_{D}–S_{SCORE} map (middle column, Fig. 8a) is not steeper than the R_{D}–TSSK p value map (middle column, Fig. 8b). Whose values for R_{D}–TSSK p value map, show a meridional gradient increasing Eastward. Finally, the R–S_{SCORE} map (right column, Fig. 8a) inherits the main characteristics of the R_{B}–S_{SCORE} map: low values for the Northeast and Spanish Levante area, and high values of factor on the West of Spain. However, the R–TSSK p value map (right column, Fig. 8b) exhibits lesser simulation skills on the upper half of Spain.
If the number of members of the ensemble is high (in this work, seventeen models were considered), the S_{SCORE} metric appears unsuitable since it tends to compute weighting factors evenly. The TSSK p value seems to solve the problem because it penalizes the lesser skilful models with quite small R values (R → 0). The better models should have R_{B} and/or R_{D} values greater than the significance level α, so the reliability factor R must be greater than α^{2}. Nevertheless, there is a problem when, in a particular site, the reliability factor of all ensemble members tends to zero (R ≪ α^{2}). In this particular case, it has been shown that a high Pm value is not a consequence of the high performance of one particular model, instead, it exhibits the poor performance of all models, hence “the less worse model” prevails. This fact shows the need to achieve an agreement between metrics. The agreement should reward those models with better results, according to the statistical test, giving them a greater specific weight in the ensemble PDF, as the TSSK p value metric does. However, if quite small R factor values have been computed using the TSSK p value metric, the agreement should tend to level the models weight, as the S_{SCORE} metric does. For example, assuming that the TSSK p value metric is used, and a significance level α = 0.01 is set. Therefore, R_{B} > 0.01 means that the model data are statistically unbiased against observed data, and R_{D} > 0.01 means that the model data are in agreement with the reference distribution in the future. Hence, a potential critical value for the reliability factor could be R_{CRIT} = R_{B-CRIT}·R_{D-CRIT} = α^{2} = 1E−4. The models which would have reliability factor values lower than the critical value (R < 1E−4) should be evenly weighted in the distribution. The most extreme case is when all RCM have reliability factor values lower than the critical value: in this case, all members of the ensemble should have almost the same weight.
The change maps, interpolated from ensemble PDF, from both metrics present significant changes in the mean AMDSL for the predicted horizon in almost all Spanish territory. Nevertheless, the difference between metric maps can be almost 40 %. Significant positive values (values computed with S_{SCORE} are higher) are located in the Spanish Northeast, Cantabrian Coast, and the Central Plateau, while significant negative values (the higher values from TSSK p value) are located in the South and Southeast of the Iberian Peninsula.
As a summary, although metrics agree with the general increase of standard deviation (SD) of AMDSL in Spain, the spatial distribution of significant changes is slightly different: the SD–S_{SCORE} change map (right column, Fig. 12a) shows a significant change for the regions of Galicia and Extremadura, while the change is not significant on the central Iberian Peninsula, Andalusia, and Levante area. Not only does the SD–TSSK p value change map agree on the significant changes for Galicia and Extremadura, but it also does so for areas of Andalusia and Aragon. The differences between metrics are significant and positive (values computed with S_{SCORE} are higher) on almost all the eastern half of the Iberian Peninsula, except for the Levante area, where the differences are negative but not significant.
7 Conclusions
This paper presents two different metrics to compute the RCM weighting factors (the normalized reliability factor Pm), which allow to build ensemble based on dry spell lengths PDF. The weighting factors have been computed using the REA methodology. The REA factors (reliability and future convergence), have been computed using two metrics (S_{SCORE} and TSSK p value). The sensitivity of ensemble PDF to both metrics was assessed. The S_{SCORE} metric produces values which have a comparable magnitude scale, so all members have a meaningful contribution on final ensemble PDF. On the other hand, the TSSK p value is a steeper metric, then models with worse performance, may have a very low contribution to ensemble PDF.
Using different statistics calculated from the ensemble PDF, interpolated maps were constructed to analyze the maximum dry spells for p < 1 mm/d in mainland Spain, and conclusions were made about the differences according to the metric considered.
The main difference between the results of the metrics has focused on the difference in scale to assign weighting scores.
For R the relationship between the values calculated with S_{SCORE} and with TSSK p value is high, but it is weakened considerably when the normalized reliability factor Pm is estimated. In some areas of the study area, high Pm values correspond to low values of R. In this case, the high value of Pm is not reflecting a remarkable quality of fit, but is the result of poor adjustment of the set of all RCM, especially those considered the best in other areas of the territory.
The S_{SCORE} approach presents problems when the number of models participating in the ensemble is large (in this study, seventeen), because the approach tends to compute similar weighting factors.
In conclusion, the TSSKp value approach is more parsimonious than the S_{SCORE} approach, in the building of ensemble RCMs based on AMDSL PDF.
Acknowledgments
This work has been developed in the framework of R&D Project CGL2008-02530/BTE, financed by State Secretary of Research of the Spanish Ministry of Science and Innovation (MICINN). The funding received is gratefully acknowledged.