Climate Dynamics

, Volume 40, Issue 5, pp 1271–1290

Assessing uncertainties in the building of ensemble RCMs over Spain based on dry spell lengths probability density functions


    • Facultad de Ingeniería, Departamento de Ingeniería Civil, Grupo de Investigación Ciencia e Ingeniería del Agua y el AmbientePontificia Universidad Javeriana
    • Department of Civil Engineering, R&D Group of Water Resources ManagementUniversidad Politécnica de Cartagena
    • Department of Civil Engineering, R&D Group of Water Resources ManagementUniversidad Politécnica de Cartagena

DOI: 10.1007/s00382-012-1381-5

Cite this article as:
Giraldo Osorio, J.D. & García Galiano, S.G. Clim Dyn (2013) 40: 1271. doi:10.1007/s00382-012-1381-5


Spain is one of the European countries with most environmental problems related to water scarcity and droughts. Additionally, several studies suggest trends of increasing temperature and decreasing rainfall, mainly for the Iberian Peninsula, due to climate variability and change. While Regional Climate Models (RCM) are a valuable tool for understanding climate processes, the causes and plausible impacts on variables and meteorological extremes present a wide range of associated uncertainties. The multi-model ensemble approach allows the quantification and reduction of uncertainties in the predictions. The combination of models (RCM in this case), generally increases the reliability of the predictions, although there are different weighting methodologies. In this paper, a strategy is presented for the building of non-stationary PDF (probability density functions) ensembles with the aim of evaluating the spatial pattern of future risk of drought for an area. At the same time, the uncertainty associated with the metric used in the construction of the PDF ensembles is assessed. A comparative study of methodologies based on the application of the Reliability Ensemble Averaging (REA), assessing its factors using two performance measures, on the one hand the Perkins Score Methods, on the other hand the Kolmogorov-Smirnov test, is proposed. The evaluation of the sensitivity of the methodologies used in the construction of ensembles, as proposed in this paper, although without completely eliminating uncertainty, allows a better understanding of the sources and magnitude of the uncertainties involved. Despite the differences between the spatial distribution results from each metric (which can be in the order of 40 % in some areas), both approaches concluded about a plausible significant and widespread increase throughout continental Spain of the mean value of annual maximum dry spell lengths (AMDSL) between the years 1990 and 2050. Finally, the more parsimonious approach in the building of ensembles PDF, based on AMDSL in peninsular Spain, is identified.


Climate changeRegional climate modelsMaximum dry spell lengthNon-stationary analysisEnsemble PDF

1 Introduction

Climate variability and change present deep impacts over both human socioeconomic activities and ecosystems. More severe and frequent hydro-meteorological extreme events suggest that several hydrological variables are reaching critical thresholds, responsible for sudden and negative impacts rather than a gradual change. Impacts on economic activities, biologic and human health are related with climate variability and change; these impacts span from a damaging effect on farm production, the increases of climate refugees, deep disorder to ecosystems, water resources scarcity and faster spread of vector diseases (Räisänen and Palmer 2001; Tebaldi and Sansó 2009; WHO 2009).

From observed datasets, a rainfall increase was observed in the north of Europe while there was a decrease in the West Mediterranean Basin (Paredes et al. 2006). These rainfall trends have been confirmed by climate model projections (Räisänen and Palmer 2001; Giorgi and Mearns 2002; Giorgi et al. 2004; Tebaldi et al. 2005). Considering temperature data from Global Climate Models (GCM) for the time reference period 1961–1990 and the prediction for 2071–2100, several authors (Giorgi et al. 2004; Tebaldi et al. 2005) have identified a generalized rise of temperatures over the Iberian Peninsula, especially in summer. Furthermore, a plausible decrease of rainfall greater than 15 % over the Iberian Peninsula for the wet season was identified (Giorgi et al. 2004). There are several efforts for the inter-comparison of Regional Climate Models (RCMs) results over Europe. Among them the works of Jacob et al. (2007) and Christensen and Christensen (2007) assessing the ability of RCMs, within the context of European project PRUDENCE, to simulate the long-term mean climate and the inter-annual variability. These authors were working with near surface temperature and precipitation.

The GCM are considered the only tools that can take into account the complex set of procedures which control the climate. Nowadays, the GCM are the worthiest source of data regarding future climate change at global scale, and about the change in frequency and severity of extreme events (Murphy et al. 2004; Sánchez et al. 2009). However, the models are subject to errors now and in the future and this gives rise to uncertainties. These uncertainties identified in climate modelling are usually associated to the initial condition, boundary conditions, parameterization and, finally, structural uncertainties (Tebaldi and Knutti 2007). The uncertainties are principally caused by unreliable projection of greenhouse gases (GHG), highly related with doubts regarding world population growth, future economic and technological development, progress in international cooperation agreements, as well as a lack of understanding of the climate system, the intrinsic randomness the process involves and current modelling constraints, among other causes (Sánchez et al. 2009; Tebaldi and Sansó 2009). Fitting a particular model to reproduce the mean values, the variability and the trends in observed data makes sense in order to reduce its own uncertainties. Regardless of the fact that confidence in models simulation has notably increased, a suitable simulation of observed data is no guarantee about the model projections (Tebaldi and Knutti 2007). Consequently, a strategy is needed in order to assess the uncertainty of climate projections, due to stakeholders often facing several simulations of unknown modelling quality. Due to uncertainties, a probabilistic forecast seems more valuable than a deterministic approach. Therefore, the assessment of uncertainty from projections using a multimodel ensemble approach makes useful tools available for planning adaptation and mitigation strategies.

If a qualitative definition of drought from a social point of view is used, it would be possible to say that drought is a recurrent phenomenon in Spain. The sentence is supported by the fact that drought threatens the supply and irrigation systems, the main water user in Spain. Nevertheless, natural cycles of drought should not seriously affect the native vegetation, which has already adapted to the complex environment. In this context, it could be questioned if the “social perception” of more frequent and stronger droughts is caused by a real trend to increase, or is a misunderstanding caused by the increasing human pressures on water resources. If the RCM predict real changes in droughts, the next question should be to ask if the results depend on weights of members in the ensemble. This work tries to provide answers to both questions.

In accordance with the works of several authors (Sánchez et al. 2009; Tebaldi and Knutti 2007), for the evaluation of uncertainties, a multimodel ensemble approach is needed. Tebaldi and Knutti (2007) suggest that the model combination generally increases the projection reliability. The main assumption of the ensemble approach is the independence between models; therefore the uncertainty should diminish as the number of available models increases. Nevertheless, this assumption could in fact be not true at all, because there are shared processes or methodologies between various models (spatial resolution, parameterization, observed dataset to fit the model, numerical methods and their deficiencies, etc.). The assumption of independency between models implies that the random errors tend to cancel out, however systematic biases in various models will be inherited by the ensemble. Another important issue is the “opportunity ensemble”, which means there are non-scientific aspects that define the ensemble members and their characteristics, giving rise to a non-random and non-systematic sample of models (Tebaldi and Knutti 2007). Finally, the scientific groups usually try to fit their models to reproduce the observed datasets; therefore the models are not really designed to span the whole uncertainty range.

Basically, the combination of members in the multimodel ensemble can be done in two ways. The first one is by neglecting the different reliability of models, and weighing them all equally (Murphy et al. 2004). The other way is to use weighted averages, where the model weight depends on some measure of performance. The main question is to define the metric for measuring model performance, because there is not just one single way of assessing this. Several works have faced the problem from different points of view (Räisänen and Palmer 2001; Giorgi and Mearns 2002, 2003; Dettinger 2005; Tebaldi et al. 2005; Sánchez et al. 2009). But sometimes, the conclusions can be opposite due to the uneven spread of results (Tebaldi and Knutti 2007).

The selection of a particular metric is pragmatic and mainly subjective. The Reliability Ensemble Averaging (REA) method (Giorgi and Mearns 2002), has been selected in the present work to compute the models weight using the empirical probability distributions from observed data. The REA method rewards with higher weight both models with great skill to simulate the observed data distribution, and models closest to the “ensemble consensus” in the future; while the models farthest from these two criteria are penalized. Weigel et al. (2010) highlight that equally weighted multimodels on average outperform the single models, and that projection errors can be further reduced by applying model weights according to some measure of performance.

While, Xu et al. (2010) updated the definition of the REA method, although leaving the use of the convergence criterion, and including multiple variables and statistics in the formulation. The criticism with the convergence criterion was that it could produce an artificial narrowing on ensemble PDF, and some tails and extreme values will be lost. On the other hand, the criticism regarding the use of a single variable (for example precipitation) arises because it could produce a weak assessment about the model’s reliability. The present work overcomes this artificial narrowing using future empirical PDF, which are built using all RCMs data to compute the convergence factor. Finally, the annual maximum dry spell lengths (AMDSL) are extreme values, which should be fit separately from other variables, because they are an independent population.

The issues previously exposed encourage the use of RCM to build ensemble probability distributions for studying the drought phenomenon. The GCM are not able to recognize regional heterogeneities of climate, therefore they are not suitable for building small scale projections, which are needed in impact studies (Paeth et al. 2011). The dynamical downscaling provided by RCM could be used to undertake this task at basin scale (Karambiri et al. 2011). Sánchez et al. (2009) used data from RCM forced by ERA40, to build ensemble CDF (Cumulative Distribution Functions) of seasonal rainfall, considering a regional approach over Europe. García Galiano and Giraldo Osorio (2010) presented a good example of the application of RCM data at basin level to study impacts on extreme events of rainfall in the Senegal River Basin (West Africa).

In the present work, the chosen hydrologic variable is the AMDSL, considering a dry spell as the number of consecutive days with rainfall below a threshold. The threshold was set at 1 mm/day. The dry spells have great interest because they are directly related to pronounced dryness of Spain’s landscapes and rainfall zone gradient, which have affected the historic decisions and discussions about the nationwide planning of water resources. A pioneering work about dry spell analysis in Spain was presented by Martín-Vide and Gómez (1999), who describe the regionalization of Spain’s territory based on fitted Markov chains to time series of dry spell. Even though the fit was not entirely satisfactory (it was not good enough on southeast of Spain), the analysis identified a clear zonal gradient of dry spell lengths, which is increasing southward. Sánchez et al. (2011) have performed a dry spell analysis using observed and simulated precipitation grids from the Iberian Peninsula, with both RCM and GCM. According to that work, the drought periods will increase throughout almost all of Spain’s territory. Moreover, the change will be greater in the south of the Peninsula, which will increase the latitudinal gradient compared with the current climate.

In contrast to Sánchez et al. (2011), the current work focuses on ensemble PDF building in order to analyze plausible trends of maximum dry spell. The first objective in this work is to outline the spatial distribution of several statistics estimated from the ensemble PDF, so they are built on all sites in the study area. The dynamic assessment of trends in the study variable (in this case of AMDSL) is enabled by fitting non-stationary probabilistic models whose parameters change over time. This is the main difference from other works, where the time series are split in time-windows to compute the change between slices, or are directly managed assuming stationary parameters.

The definition of REA (Giorgi and Mearns 2002) has been used as the metric to compute the member weights in the ensemble, in order to combine the fitted non-stationary distributions. The REA criteria were estimated from empirical probability distributions of AMDSL in 1961–1990 (the model performance criterion), and 2021–2050 (the model convergence criterion) time periods.

The second objective of the work is to quantify the influence over results, due to the selected metric to compute the member weights in the ensemble. Thus, each REA component has been computed taking into account two performance measures: the Perkins score (SSCORE, Perkins et al. 2007), and the p value from the two-sample Smirnov-Kolmogorov (TSSK) goodness of fit test. The proposed methodology tries to span the whole range of uncertainty, through the building of ensemble PDF. The empirical distributions of AMDSL are used to compute the member weights in the REA method. The aim is to find the suitable tuning that enables the ensemble PDF to be able to better reproduce the observed distribution of data, instead of only trying to simulate their mean or the standard deviation. It must be highlighted that the proposed methodology enables a dynamic performance of ensemble PDF, because it inherits the simulated variability from each ensemble member, through the fitting of non-stationary distributions to AMDSL time series.

2 Study area and datasets

2.1 Study area

The target zone of continental Spain is presented in Fig. 1a, with the discrimination of administrative boundaries (autonomous regions in Spain). Spain’s climate is characterized by an increasing gradient of rainfall in South East–North West direction, from values of less than 250 mm/year in the south-east, to the highest values in the western Pyrenees, the Cantabrian Coast and especially the Galicia region, where the mean annual rainfall is above 1,600 mm/year, presented in Fig. 1b.
Fig. 1

Zone of study: a Location of 906 sites and administrative limits (autonomous regions) of continental Spain; and b Mean annual rainfall for time period 1950–2007, from Spain02/v2.1 dataset

The climate of most of continental Spain is characterized by a dry period in July and August, which is particularly intense in the southern half of the Iberian Peninsula, and a period of rainfall during the winter months (DJF) mainly on the Cantabrian Coast (Autonomous regions of Galicia, Asturias, Cantabria and Basque Country). The Levante area (Autonomous regions of Murcia and Valencia) present a bimodal rainfall cycle, with high values in the months of April–May and October to November, and dry periods in winter and especially in summer. Several authors (Paredes et al. 2006) explore the causes of spatiotemporal variability of precipitation in Spain.

2.2 Datasets

The daily rainfall was obtained from the RCM data for Europe, provided by ENSEMBLES Project RT2B (Christensen et al. 2009). The selected RCM, presented in Table 1, correspond to those with a spatial resolution of ~0.25º (~25 × 25 km2), and data in the time period 1961–2050 over continental Spain. Finally, seventeen RCM simulations were selected, conduced by different RCMs driven by different GCMs (Global Climate Models) for scenario A1B. Grids of daily rainfall for Spain (Spain02/V2.1 dataset) with a spatial resolution of 0.2º (~20 × 20 km2), presented by Herrera et al. (2010), were considered for the bias analysis of the maximum simulated rainfall for the RCM in the control period (1961–1990). Based on the grid provided by the RCMs dataset, 906 sites for analysis were set up (Fig. 1a).
Table 1

Datasets of daily rain: observed data (Spain02/v2.1) and selected RCMs from ENSEMBLES project





Temporal cover



Observed data























































































aUniversidad de Cantabria, Spain

bCommunity Climate Change Consortium for Ireland

cMétéo-France, Centre National de Recherche Météorologiques

dDanish Meteorological Institute

eSwiss Federal Institute of Technology Zurich

fHadley Centre, UK

gInternational Centre for Theoretical Physics, Italy

hRoyal Netherlands Meteorological Institute

iNorwegian Meteorological Institute

jMax-Planck-Institut für Meteorologie, Germany


lSwedish Meteorological and Hydrological Institute

mUniversidad de Castilla La Mancha, Spain

3 Overview

On-site ensemble PDF were built, following the flow chart in Fig. 2. First, the time series of AMDSL were obtained from the RCM and the observed data. Afterward, the bias analysis outcome is the model performance criterion (RB), which was computed using the empirical PDF of AMDSL on 1961–1990. In the case of the estimation of RB, the cumulative density functions (CDFs) were built from observed dataset and RCMs for 1961–1990. Figure 3 presents an example of CDFs for selected sites.
Fig. 2

Flow diagram of methodology at each site for building ensemble PDF. The several procedures are: a Reading data, b bias analysis, c convergence analysis, d REA computing, e non stationary time series analysis, and f ensemble PDF building
Fig. 3

CDFs of AMDSL from both observed dataset (in black) and RCMs (in colour), for 1961–1990 period

The empirical PDF on 2021–2050 were used to compute the model convergence criterion (RD) trough the convergence analysis. Using both RB and RD, the reliability factor (R) and the normalized reliability factor (Pm) were obtained. Following the non-stationary analysis in the flow chart, the time series between 1961 and 2050 were considered to fit non-stationary PDF for each RCM using GAMLSS. These non-stationary PDF were used later to build the on-site ensemble PDF, using the normalized reliability factor Pm as the weighting factor.

3.1 Time series of annual maximum dry spell length (AMDSL)

Considering all the sites defined in continental Spain (Fig. 1a), the time series of length of dry spells (for P < 1 mm/day), were obtained from the daily observed rainfall dataset. In the study area, maximum lengths of dry spells greater than 1 year (365 days) were not identified. Therefore, a maximum dry spell length for each year (or annual maximum dry spell length, AMDSL), could be considered.

4 Computation of reliability factors

The weighting factors of each RCM to build the ensemble PDF were estimated. In the REA methodology, proposed by Giorgi and Mearns (2002), the model reliability factor R for the RCM is defined as follows:
$$ R = \left[ {\left( {R_{B} } \right)^{b} \times \left( {R_{D} } \right)^{d} } \right]^{{\left[ {{1 \mathord{\left/ {\vphantom {1 {\left( {b \times d} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {b \times d} \right)}}} \right]}} $$
where RB is a measure of the model reliability as a function of the model bias in the simulation of present day AMDSL (1961–1990 period), so the greater the bias, the lesser the confidence of the model. RD is a factor for measuring model reliability in terms of distance to REA future average change, with the reliability of the model being lower at a greater distance. The parameters b and d are the criterion weights. It is assumed b = d = 1, giving equal weight to both criteria. The RB is considered a measure of the model performance criterion, and RD a measure of the model convergence criterion (Giorgi and Mearns 2002).

The computations of weighting factors RB and RD is based on empirical cumulative distribution functions (e-CDF), and two quantitative measures to compare the agreement among the probability functions. The Weibull plotting position formula was used to compute the empirical quantiles of e-CDF. The first metric is the p value from the well-known two-sample Smirnov-Kolmogorov goodness of fit test (hereafter TSSK test; Sheskin 2000; Gibbons and Chakraborti 2003). The second metric corresponds to the skill score (SSCORE) proposed by Perkins et al. (2007), measuring the common area under the PDF curves.

In the case of the estimation of model performance criterion RB, the e-CDF were built from observed data and from RCM over the 1961–1990 time period. The PDF from the observed data represents the “reference” in this period. For the model convergence criterion RD, the difficulty is that there is no known reference PDF for future climate. According to Giorgi and Mearns (2002), an iterative process is followed to obtain the estimated PDF and therefore to estimate RD. The estimated PDF was built using bootstrapping techniques with N = 10,000 data, considering the simulated series for the models between 2021 and 2050 (30 years). Initially, the reference PDF is built assigning equal weights to all RCM (that is, each model consists of 10,000/17 ≈ 588 data, obtained from sampling with replacement from the simulated series of 30 years). Then, the distance of each RCM to the estimated PDF is calculated and consequently the assigned weights are readjusted. This procedure converges quickly after some iteration. It should be noted that the PDF built in this way is only an estimate of the distribution of the AMDSL of future climate projection. In accordance with Giorgi and Mearns (2002) and Giorgi and Mearns (2003), the REA average does not represent the actual climate response to the climate forcing scenarios; however the REA average represents the best estimate of this response.

Finally, the likelihood associated with a simulated change for the RCM is proportional to the model reliability factor R (Giorgi and Mearns 2002, 2003). The normalized reliability factors, Pm in Eq. (2), can be interpreted as this likelihood associated with each RCM. It has been shown that the normalized reliability factors are analogous to the accuracy factors defined by Bayesian approaches (Tebaldi et al. 2005; Tebaldi and Knutti 2007). The likelihood Pm for each RCM is defined as follows (Giorgi and Mearns 2003):
$$ Pm_{i} = \frac{{R_{i} }}{{\sum\nolimits_{1}^{N} {R_{j} } }} $$

In this work, the TSSK p value and the Perkins SSCORE have been used to compute the reliability factors (RB and RD) and, consequently, R and Pm, giving different measures of reliability that will be discussed later.

4.1 Two-sample Smirnov-Kolmogorov goodness of fit test

The TSSK test is a non-parametric test which evaluates the equality between the probability functions associated with two independent samples, quantifying the maximum distance between the e-CDF built from the samples (the time series of AMDSL). The two-sided TSSK statistic is:
$$ D_{n1,n2} = \mathop {\max }\limits_{x} \left| {S_{n1} \left( y \right) - S_{n2} \left( y \right)} \right| $$
where Sn1 (y) and Sn2 (y) are the empirical distribution functions of the independent samples, and n1 and n2 are the sample sizes. For the asymptotic null distribution, the following has been proved (Gibbons and Chakraborti 2003):
$$ \mathop {\lim }\limits_{{n_{1} ,n_{2} \to \infty }} P\left[ {\sqrt {\frac{{n_{1} n_{2} }}{{n_{1} + n_{2} }}} D_{{n_{1} ,n_{2} }} \le d} \right] = L\left( d \right){\text{ with}}\quad L\left( d \right) = 1 - 2\sum\limits_{i = 1}^{\infty } {\left( { - 1} \right)^{i - 1} \exp \left( { - 2i^{2} d^{2} } \right)} $$
Finally, the p value of the goodness of fit test could be computed using:
$$ p\left( d \right) = 1 - L\left( d \right) = 2\sum\limits_{i = 1}^{\infty } {\left( { - 1} \right)^{i - 1} \exp \left( { - 2i^{2} d^{2} } \right)} $$
In general, if the p value is greater than α (the statistical significance level), the fit is accepted as successful. Common values of α are 0.01 or 0.05.

4.2 Perkins skill score

Perkins et al. (2007) developed a method to evaluate the similarity between two PDFs through a skill score (SSCORE) that calculates the common area under the two curves defined by the PDFs. If the distributions overlap perfectly, the SSCORE will be one. As the relationship between the e-CDF (used by TSSK test) and the empirical PDF is straightforward, the SSCORE is easy to compute. The SSCORE applied by several authors just to assign weights to different climate models of an ensemble of several hydrometeorological variables (Perkins and Pitman 2009; Boberg et al. 2009, 2010; Smith and Chandler 2010; Giraldo Osorio and García Galiano 2011), is estimated as:
$$ S_{SCORE} = \int\limits_{ - \infty }^{{x_{0} }} {f_{1} \left( x \right){\text{d}}x} + \int\limits_{{x_{0} }}^{ + \infty } {f_{2} \left( x \right){\text{d}}x} $$
where, while x < x0, then f1(x) < f2(x); likewise, if x > x0, then f1(x) > f2(x).

4.2.1 Non-stationary analysis of AMDSL time series

The stationarity of hydrometeorological time series cannot be guaranteed in the target area, therefore a methodology for the modeling of time variation of PDF parameters is encouraged. In the present work, GAMLSS tools are applied, assuming parametric distributions of probabilities for the explained variable (in this case, the explained variable is Y = AMDSL). The PDF parameters have been modeled as a function of the explanatory variable (time t), using cubic spline as smoothing functions. Rigby and Stasinopoulos (2005) and Stasinopoulos and Rigby (2007), present a detailed discussion regarding the selection and fitting of statistical model using GAMLSS tools.

The number of parameters used to fit statistical models depends on the chosen distribution, but it is usually less than four (the first parameter for location, the second for scale, and finally the third and fourth are shape parameters). In the present work, distributions with more than two parameters are not justified due to the short length of records (90 annual observations for 1961–2050 time period), hampering the fitting of shape parameters. Therefore, four distributions of two parameters, widely used for statistical modeling of hydrologic series, were taken into account: Gamma (GA), Gumbel (GU), Lognormal (LN), and Weibull (WEI). The relationship between both the first and second distribution parameters with E[Y] and Var[Y] is explained by Stasinopoulos et al. (2008).

In accordance with the procedure suggested by Stasinopoulos and Rigby (2007), the models were fitted considering the Schwarz Bayesian Criterion (SBC), which uses the penalty k = log(n), limiting the effective degrees of freedom to λ ≤ 4. The value of λ is obtained for each distribution considered in every site. The best distribution was selected according to the minimum value of SBC. The independence and normality of the randomized quantile residual are used to ensure that the selected model adequately describes the data, estimating the mean, variance, skewness, kurtosis, and the Filliben correlation coefficient (Filliben 1975). Additionally, visual inspections of qq-plot and worm plot (not shown), proposed by van Buuren and Fredriks (2001) were performed to verify the residuals normality.

4.2.2 On site ensemble PDF and interpolated maps of AMDSL

Non-stationary PDF, associated with each RCM, were fit. Afterwards, ensemble PDF were built on grid site, using the information provided by the maps of normalized reliability factors Pm. For building on-site ensemble PDF, sampling with 10,000 values extracted from non-stationary PDF of each RCM were considered. For example, if Pmi = 0.25, then the non-stationary PDF fitted to RCMi contributed with 2,500 values to the final ensemble PDF. Since the ensemble PDF was built using distributions with non-stationary parameters, the procedure was repeated for defining the final PDF each 10 years (1961, 1970, 1980, and so on until 2050).

From the ensemble PDF several spatial distributions of statistics with their respective 95 % confidence intervals (CI) were computed, using bootstrapping techniques (Efron and Tibshirani 1993).

5 Results analysis

5.1 Relationship between TSSK p value and Perkins skill score

The metrics used to assess the PDF agreement (SSCORE and TSSK p value), give values between [0,1], with the unity meaning a perfect agreement between the empirical PDF. The box-plots in Fig 4a show the distribution of the RB factor for each RCM over all grid sites, computed with SSCORE, while Fig. 4b holds the box-plots for the RB factor computed with TSSK p value.
Fig. 4

Comparative of RB factor from RCMs over all grid sites: a Box-plots of RB factor computed with SSCORE; b Box-plots of RB factor computed with TSSK p value; and c dispersion plot of median values of TSSK p value versus SSCORE for each RCM. In this last figure, the dotted lines show the location of median value computed for all RCM

The median values of RB of every RCM, computed from the box-plot, were used to build Fig. 4c. The picture reveals the main difference between the metrics used: the TSSK p value is a “steeper” metric than the SSCORE, so it has values through various scales (the RB–TSSK axis is in logarithmic scale), while the SSCORE values are contained in a comparable scale. Figure 4c shows a strong relationship between RB–TSSK p value and RBSSCORE. The scatter plot of RB median values indicates that the RCM with better skill to simulate the observed data are DMI/ARPEGE and CNRM/RM5.1, while the models with the worst performance are DMI/BCM, SMHI/BCM and METNO/BCM.

Figure 5a and b show the box-plot built from RD values. The values of RD are, in general, greater and show lesser spread than the RB values. Figure 5c shows a weaker relationship between RD–TSSK p value and RDSSCORE than for the RB factor, but it is noteworthy that the model with weaker convergence is DMI/ARPEGE, which is one of the better for RB. This shows that although models convergence in the future it does not guarantee a small bias from the observed data, as was stated by Giorgi and Mearns (2002).
Fig. 5

Similar to Fig. 4, but for RD

Figure 6a and b present the box-plots built for R, while Fig. 6c shows a strong relationship between the median values of R–TSSK p value and RSSCORE. Figure 6c identifies KNMI/RACMO2 and MPIM/REMO as high performing models. On the other hand, as well as for the RB factor, the R factor displays that DMI/BCM, SMHI/BCM and METNO/BCM are less skilful models. It is remarkable that these models are driven by the same GCM (BCM).
Fig. 6

Similar to Fig. 4, but for R

Finally, Fig. 7a and b show the box-plot of Pm for both considered metrics, and Fig. 7c displays the scatter plot of median values of Pm–TSSK p value versus PmSSCORE extracted from box-plots. From Fig. 7c the weak relationship between Pm–TSSK p value and PmSSCORE should be highlighted. This could be explained through the R values computed with TSSK p value: in some sites, every RCM shows poor reliability (all RCM have R–TSSK p value → 0), so the normalized Pm values would have deceitful values about R. In other words, the high value of Pm associated with a particular RCM does not correspond with its high reliability R. Instead, it is showing the lack of reliability of every model, particularly the best models in other sites of the study area, where high Pm values actually correspond with high R values.
Fig. 7

Similar to Fig. 4, but for Pm

The effect on results of the scale difference of both metrics used could be predicted using the values in Table 2. For each RCM, the pair of median values (SSCORE, TSSK p value) of RB, RD, R and Pm, used to build the scatter plots on Figs. 4, 5, 6 and 7, are presented in Table 2. The Pm columns do not add up to one because they are not from a particular site, instead they are the median values from the box-plot in Fig. 7a and b. Anyhow, if the Pm column values are divided by their summation (1.003 for SSCORE, and 0.8360 for TSSK p value), a measure of “average contribution” from every RCM to the final ensemble PDF is obtained. Considering the previous fact, if the Pm values are computed using the TSSK p value, the five RCM with highest Pm median values contribute with 82.1 % to the final ensemble PDF, according to Table 2, with the MPIM/REMO being the most weighted RCM (18.2 %). However, if the Pm are computed using the SSCORE, all RCM are considered almost equally in the ensemble PDF. The five RCM with the highest SSCORE contribute with 30.6 % to the final PDF, while the five lowest contribute with 27.2 %.
Table 2

Median values of RB, RD, R and Pm using both the SSCORE and TSSK p value (as presented in box-plots of Figs. 4 and 7)







TSSK p value


TSSK p value


TSSK p value


TSSK p value






























































































































































In the Figs. the model RegCM3, is presented as REGCM3

The highest five Pm values from each metric have been highlighted in italics. The Pm columns do not add up to one because they are not from one particular site

Two metrics to compute the member weights in the ensemble PDF have been presented, whose results are appreciably different: the TSSK p value is a steeper metric which only takes into account the better models to build the ensemble PDF, while the SSCORE uses virtually all members, with slightly greater weights for the better models. The following sections will analyze the effects on computation of the mean, standard deviation and some centile values, using the previously presented metrics to build the ensemble PDF for every site.

5.2 Spatial distributions

5.2.1 Reliability factor maps

The spatial distributions of mean values of reliability factors are presented in Fig. 8. These maps give a general idea about the strength of RCM set to model the AMDSL. The different scale of values obtained with the metrics (SSCORE and TSSK p value), which have been explained previously, should be taken into account.
Fig. 8

Mean value maps of reliability factor RB (left), RD (centre); and R (right), assessed using: a the SSCORE; and b the TSSK p value

The upper frames in the picture (Fig. 8a), show the maps built with the SSCORE. The RBSSCORE map presents higher values (blue in color) in Galicia and on the Central Plateau of the Iberian Peninsula (Autonomous regions of Castilla La Mancha, Extremadura, and Madrid), which indicates that the set of models exhibits a lesser bias in this area. This bias increases eastward, with the lesser values of RBSSCORE for the Levante area. The RDSSCORE map does not present a meaningful variation of spatial distribution, in accordance with the large convergence in the future of every model in the study area. The lesser values of RDSSCORE are located along the Mediterranean Sea coastline. The RSSCORE map is strongly influenced by the spatial changes on the RBSSCORE map, so the lesser values of RSSCORE are located in the southeast of Spain too. All the previous maps present values of comparable magnitude scale.

The bottom frames of Fig. 8b show the maps built with the TSSK p value. The RB–TSSK p value map shows a spatial variation which looks like the distribution presented by the RBSSCORE map. However, the range of values encompasses from less than 0.001 (northeast of Spain) until greater than 0.1 (West Andalusia). The fact that the values are spread in several scales makes more discernible the areas where, in general, the set of models has a good or deficient performance. As a result, there is a poor agreement between the observed data and the set of models for the north and northeast of Spain, while the mean bias is lower in west Galicia and in the southwest of Spain. The RD–TSSK p value map reveals a zonal gradient, decreasing eastward. Nevertheless, there are large values of future convergence of the set of models in the study area, if they are compared against the confidence value α = 0.05 (generally accepted as a good fit between probability distributions). The RD–TSSK p value map does not have scale variations as large as those of the RB–TSSK p value map, but they are greater than the variations in the RDSSCORE map. Lastly, the R–TSSK p value map shows lesser values in the upper half of the Iberian Peninsula, and meaningful scale variations (from lesser than 0.001 until greater than 0.01). The last feature will be important when the normalized reliability factor is computed: the poor performance of the set of models in the north of Spain could lead to computing deceitfully large Pm values, associated with models which are not important in other areas.

5.2.2 Pm factor maps

The spatial distributions of Pm factor are presented in Fig. 9 for SSCORE, and Fig. 10 for TSSK p value. These maps represent the contribution of each RCM to the ensemble PDF in every analysis site. In both Figs. 9 and 10, the darker blue color indicates higher a Pm factor. In both cases, the sum of seventeen maps for each metric should be unity at all sites.
Fig. 9

Maps of Pm for each RCM, using the REA values estimated with the Perkins methodology (SSCORE)
Fig. 10

Maps of Pm for each RCM, using the REA values estimated with the Smirnov-Kolmogorov test (TSSK p value)

From the results summarized in Table 2, the models with best performance, according to PmSSCORE, are identified (CNRM/RM5.1, DMI/ARPEGE, KNMI/RACMO2, MPIM/REMO and SMHI/ECHAM5-r3). These RCMs exhibit spatial distributions without a clear spatial pattern (Fig. 10), with values ranging between 0.055 and 0.065. The models with worst performances, from Table 2 (DMI/BCM, METNO/BCM and SMHI/BCM), present heterogeneous spatial distributions, with the value range spread between 0.040 and 0.060. The lowest values of these three models are located on the northeast quadrant of Spanish territory. According to Figs. 7a and 9, every model contributes with 4–7 % to the ensemble PDF in the analysis sites.

The maps of Pm–TSSK p value (Fig. 10) exhibit greater spatial variation than PmSSCORE maps. The better models, according to Pm–TSSK p value in Table 2 (DMI/ECHAM5-r3, KNMI/RACMO2, METO HC/HAD, MPIM/REMO and UCLM/PROMES), show significant spatial variations. From the above five models, only two (KNMI/RACMO2 and MPIM/REMO) show an outstanding performance according to the R factor (see Table 2, R–TSSK p value column). However, two of them could be considered models with mean performance (METO HC/HAD and UCLM/PROMES), or even be among the poorer performance models (DMI/ECHAM5-r3). A question arises as to the explanation for this situation. The answer is the uneven spatial distribution of the REA value of the set of models (R–TSSK p value in Fig. 8). From Fig. 8, every RCM has bad performance in the northern half of Spain. Consequently, high Pm values computed from some models (particularly, DMI/ECHAM5-r3, METO HC/HAD and UCLM/PROMES) are not an indication of outstanding reliability of particular models. On the contrary, this reveals the lacking of reliability of the set of models in these Spanish territory areas.

5.2.3 Maps of interpolated statistics from ensemble PDF

Once Pm maps are calculated for each RCM (whatever metric is used, SSCORE or TSSK p value), the ensemble PDF for each site of the study area was built.

It has been shown that, depending on the metric chosen, the contribution of each RCM to the final ensemble PDF can differ significantly. Therefore, the maps constructed for different statistics are expected to reflect these differences. The selected statistics and their confidence intervals to 95 %, were calculated using bootstrapping techniques.

The maps built for the mean of ensemble PDF of both metrics AMDSL (SSCORE and TSSK p value) for the years 1990 and 2050, are presented in Fig. 11. The maps for 1990 and both metrics show an increasing gradient towards the south, with maximum values of mean AMDSL in the Southeast and Southwest of the Spanish territory. Similar spatial distribution is presented for 2050, but with higher values in general in the South of Spain.
Fig. 11

Maps of AMDSL mean for years 1990 (left) and 2050 (center), and their percentage change (right), assessed as [100 × (map2050 − map1990)/map1990]. The maps were built using: a the SSCORE; b the TSSK p value to compute the reliability factors; and c the difference maps between metrics, assessed as [100 × (mapa − mapb)/0.5 × (mapa + mapb)]. The shaded areas represent significant change/difference (95 % confidence interval)

The right column of Fig. 11 presents the calculated change maps (difference 2050–1990). From them and for both metrics, significant increases in the average value of the future horizon AMDSL are predicted for virtually all the Spanish territory, except for some areas of the central plateau; northeast of the Iberian Peninsula; and the Cantabrian Coast. The maps of differences between the metrics (Fig. 11c) show a similar spatial distribution for the selected years.

The positive values (values computed with higher SSCORE) are concentrated in the northeast quadrant of the Iberian Peninsula; the Cantabrian Coast; and the central plateau, while negative values (the higher values from the TSSK p value outcome) are located in the South and Southeast of the Iberian Peninsula. Despite the differences between the maps for each metric (which can be up to 40% in some areas), both approaches concluded that there will be a significant and widespread increase throughout continental Spain of the mean value of AMDSL between 1990 and 2050.

The maps constructed for the standard deviation (SD) for the years 1990 and 2050, and for both metrics, are presented in Fig. 12. These maps also show increasing values towards the South, as well as the maps for the mean value. However, the difference maps between the metrics (Fig. 12c) show that SD–SSCORE maps have significantly higher values in the Northeast of the Iberian Peninsula and in the central plateau of Spain, while the SD–TSSK p value maps present significantly higher values concentrated in a few places in the Southeast of the Peninsula. Although both metrics coincide with the general increase in SD throughout the Spanish territory (1990–2050 relative change maps, right column in Fig. 12), the conclusions referring to the significance of such a change are slightly different. With SSCORE, a significant change in SD is estimated mainly in Galicia (North) and Extremadura (Southwest), while it is not possible to conclude about a significant change in this statistic for the central area of the Peninsula, Andalusia, and the Levante area (Southeast). The TSSK p value also coincides with regard to a significant change of SD in Galicia and Extremadura, but also for zones of Andalusia and Aragon. In the latter two areas, SSCORE does not consider a significant change. The increase in SD is understood as an increase in variability, which implies increases in the highest quantile of the distribution of AMDSL.
Fig. 12

Similar to Fig. 11, but for maps of AMDSL standard deviation (SD)

6 Discussion of results

Two metrics (SSCORE and TSSK p value) have been used to compute the models weighting factor on the ensemble PDF (Pm in eq. 2), through the REA method. The Pm value should always be in the [0,1] interval, but the computed values with the SSCORE metric are in the same scale (0.35–0.90, as was presented in Fig. 7a), while the values computed with the TSSK p value metric, encompass several magnitude scales (from lower than 0.001 until greater than 0.1, Fig. 7b).

Several scatter plots have been presented in order to analyse the relationship between the SSCORE and TSSK p value. From these scatter plots, the models were prioritized according to their performance for modelling bias from observed data (RB, Fig. 4c), the future convergence (RD, Fig. 5c), and taking into account the two previous criteria together (R, Fig. 6c). According to RB, the less-biased models are DMI/ARPEGE and CNRM/RM5.1 (both driven by GCM ARPEGE), on the other hand DMI/BCM, SMHI/BCM, and METNO/BCM are the more biased models. However, the RD factor presents the DMI/ARPEGE RCM as the farthest one from the future general agreement. Therefore, the fact of future convergence of models does not guarantee that the biases are small compared with the observed data. Finally, the reliability factor R indicates that KNMI/RACMO2 and MPIM/REMO are, in general, the more suitable models to simulate AMDSL in Spain. However, DMI/BCM, SMHI/BCM and METNO/BCM RCM (the same as RB factor) show, in general, the least skill.

In spite of the different magnitude scales of results, the metrics agree that the set of models exhibits lesser bias on areas located in Southwest quadrant of Spain and Galicia (RB factor is higher, left column Fig. 8), while the observed data are worse simulated by the set of models for the North and Northeast. As for the future convergence maps, the RDSSCORE map (middle column, Fig. 8a) is not steeper than the RD–TSSK p value map (middle column, Fig. 8b). Whose values for RD–TSSK p value map, show a meridional gradient increasing Eastward. Finally, the RSSCORE map (right column, Fig. 8a) inherits the main characteristics of the RBSSCORE map: low values for the Northeast and Spanish Levante area, and high values of factor on the West of Spain. However, the R–TSSK p value map (right column, Fig. 8b) exhibits lesser simulation skills on the upper half of Spain.

If the number of members of the ensemble is high (in this work, seventeen models were considered), the SSCORE metric appears unsuitable since it tends to compute weighting factors evenly. The TSSK p value seems to solve the problem because it penalizes the lesser skilful models with quite small R values (R → 0). The better models should have RB and/or RD values greater than the significance level α, so the reliability factor R must be greater than α2. Nevertheless, there is a problem when, in a particular site, the reliability factor of all ensemble members tends to zero (R ≪ α2). In this particular case, it has been shown that a high Pm value is not a consequence of the high performance of one particular model, instead, it exhibits the poor performance of all models, hence “the less worse model” prevails. This fact shows the need to achieve an agreement between metrics. The agreement should reward those models with better results, according to the statistical test, giving them a greater specific weight in the ensemble PDF, as the TSSK p value metric does. However, if quite small R factor values have been computed using the TSSK p value metric, the agreement should tend to level the models weight, as the SSCORE metric does. For example, assuming that the TSSK p value metric is used, and a significance level α = 0.01 is set. Therefore, RB > 0.01 means that the model data are statistically unbiased against observed data, and RD > 0.01 means that the model data are in agreement with the reference distribution in the future. Hence, a potential critical value for the reliability factor could be RCRIT = RB-CRIT·RD-CRIT = α2 = 1E−4. The models which would have reliability factor values lower than the critical value (R < 1E−4) should be evenly weighted in the distribution. The most extreme case is when all RCM have reliability factor values lower than the critical value: in this case, all members of the ensemble should have almost the same weight.

The change maps, interpolated from ensemble PDF, from both metrics present significant changes in the mean AMDSL for the predicted horizon in almost all Spanish territory. Nevertheless, the difference between metric maps can be almost 40 %. Significant positive values (values computed with SSCORE are higher) are located in the Spanish Northeast, Cantabrian Coast, and the Central Plateau, while significant negative values (the higher values from TSSK p value) are located in the South and Southeast of the Iberian Peninsula.

As a summary, although metrics agree with the general increase of standard deviation (SD) of AMDSL in Spain, the spatial distribution of significant changes is slightly different: the SD–SSCORE change map (right column, Fig. 12a) shows a significant change for the regions of Galicia and Extremadura, while the change is not significant on the central Iberian Peninsula, Andalusia, and Levante area. Not only does the SD–TSSK p value change map agree on the significant changes for Galicia and Extremadura, but it also does so for areas of Andalusia and Aragon. The differences between metrics are significant and positive (values computed with SSCORE are higher) on almost all the eastern half of the Iberian Peninsula, except for the Levante area, where the differences are negative but not significant.

7 Conclusions

This paper presents two different metrics to compute the RCM weighting factors (the normalized reliability factor Pm), which allow to build ensemble based on dry spell lengths PDF. The weighting factors have been computed using the REA methodology. The REA factors (reliability and future convergence), have been computed using two metrics (SSCORE and TSSK p value). The sensitivity of ensemble PDF to both metrics was assessed. The SSCORE metric produces values which have a comparable magnitude scale, so all members have a meaningful contribution on final ensemble PDF. On the other hand, the TSSK p value is a steeper metric, then models with worse performance, may have a very low contribution to ensemble PDF.

Using different statistics calculated from the ensemble PDF, interpolated maps were constructed to analyze the maximum dry spells for p < 1 mm/d in mainland Spain, and conclusions were made about the differences according to the metric considered.

The main difference between the results of the metrics has focused on the difference in scale to assign weighting scores.

For R the relationship between the values calculated with SSCORE and with TSSK p value is high, but it is weakened considerably when the normalized reliability factor Pm is estimated. In some areas of the study area, high Pm values correspond to low values of R. In this case, the high value of Pm is not reflecting a remarkable quality of fit, but is the result of poor adjustment of the set of all RCM, especially those considered the best in other areas of the territory.

The SSCORE approach presents problems when the number of models participating in the ensemble is large (in this study, seventeen), because the approach tends to compute similar weighting factors.

In conclusion, the TSSKp value approach is more parsimonious than the SSCORE approach, in the building of ensemble RCMs based on AMDSL PDF.


This work has been developed in the framework of R&D Project CGL2008-02530/BTE, financed by State Secretary of Research of the Spanish Ministry of Science and Innovation (MICINN). The funding received is gratefully acknowledged.

Copyright information

© Springer-Verlag 2012