1 Introduction

The application of general circulation models (GCMs) driven by prescribed greenhouse gas (GHG) emission scenarios is nowadays the most common way to obtain physically based climate projections. Due to their coarse spatial resolution (typically 100–300 km horizontal grid spacing with an effective resolution of about 1,000 km) GCMs currently fail to properly represent many regional and local climate processes, such as, e.g., orographic precipitation (McGregor 1997). In order to generate climate simulations with finer horizontal resolution, regional climate models (RCMs) are nested within the GCM large scale atmospheric circulation over a limited area (Giorgi and Mearns 1991, 1999; McGregor 1997; Wang et al. 2004; Rummukainen 2010) and the added value of RCMs in representing regional climate characteristics has been demonstrated in several studies (Jones et al. 1995; Laprise 2003; Castro et al. 2005; Buonomo et al. 2007; Feser et al. 2011).

Regional as well as global climate projections are subject to considerable uncertainties which can be roughly divided into three components: (1) Uncertainty due to natural variability, (2) uncertainty due to unknown future GHG emissions, and (3) uncertainty due to imperfect simulation of the climate system (Collins 2007). In order to analyse these uncertainties in RCM projections, large-scale European projects such as PRUDENCE (http://prudence.dmi.dk/; Christensen and Christensen 2007) and ENSEMBLES (http://ensembles-eu.metoffice.com/; van der Linden and Mitchell 2009) produced coordinated multi-model ensembles, improving rapidly the knowledge about uncertainties in regional climate projections in the last decade. Furthermore, these simulations provided the basis for most investigations of regional climate change impacts over Europe in recent years.

Due to limited computational capacities, only a limited number of RCM simulations can be realised and it is a question of the experimental design which uncertainty components are primarily tackled within the ensemble. Therefore, missing realisations within RCM ensembles are a common problem and even simple ensemble estimates such as mean and variability are potentially biased due to unequal sampling of the uncertainty components. In order to avoid such biases, Déqué et al. (2007) introduced an iterative data reconstruction method which assumes additivity between uncertainty components in order to estimate the missing climate change signals (CCSs). This reconstruction method was further applied in several studies in order to obtain a balanced design for the analysis of variance components (Déqué et al. 2007; Heinrich and Gobiet 2011b; Prein et al. 2011; Déqué et al. 2012). However, as the method relies on an implicit formulation of the uncertainty components, it cannot be used to extend the ensemble to experiments outside of the original experimental design (e.g., for GCMs that have not been used as driver for any RCM in the ensemble). For such an extension, scaling techniques are widely applied (Mitchell et al. 1999; Mitchell 2003; Rummukainen et al. 2003; Harris et al. 2006; Hingray et al. 2007; Ruosteenoja et al. 2007). They have been originally used to derive regional climate projections of time horizons or emission scenarios which have not been GCM simulated, by scaling the global mean temperature change of simple energy balance models with the geographical pattern of the GCM simulations (pattern-scaling). In our case, we aim at predicting the RCM response from the driving GCM response, by applying a scaling relationship according to the regional CCSs. In this respect, it has already been shown that scaling has skill at the GCM grid point scale (Kendon et al. 2010).

The aim of our study is to assess and compensate for the potential GCM sampling bias in expected regional mean climate change and the associated uncertainty of the ENSEMBLES RCM projections by data reconstruction and combination with the much larger GCM ensemble of the third phase of the Coupled Model Intercomparison Project (CMIP3; Meehl et al. 2007). For this purpose, we first assess the skill of different statistical additive and scaling reconstruction methods in reproducing ensemble mean and standard deviation. We then apply the most appropriate reconstruction method to the sparsely filled ENSEMBLES simulation matrix and further extend the matrix to all available GCM simulations of the CMIP3 ensemble forced by the A1B emission scenario (Nakicenovic et al. 2000). As final result, we present an update of expected regional climate change for air temperature and precipitation amount in eight European subregions until the mid-twentyfirst century and reassess its uncertainty under the light of the extended ensemble.

The paper is structured as follows. Section 2 introduces the data and study regions. In Sect. 3 we describe the reconstruction methods. In Sect. 4 we explain the setup of the cross-validation and the statistical analysis. In Sect. 5 we present the cross-validation results. Section 6 provides a revision of expected regional climate change and its uncertainty over Europe, followed by Sect. 7 which sums up the key findings of this study.

2 Climate model data and study region

We use the RCM data from the ENSEMBLES project which produced a set of 21 high resolution RCM simulations with a horizontal grid spacing of about 25 km (Table 1). The ensemble consists of 8 GCMs and 16 RCMs, but due to limited computational resources, only a small fraction (16.4 %) of the possible GCM–RCM combinations could be realised. Sub sampling mainly addressed uncertainty in boundary conditions (choice of the driving GCM) and RCM model formulation (Christensen et al. 2010). Since the choice of the GHG emission scenario is less important until the mid-twentyfirst century (Hawkins and Sutton 2009, 2011; Prein et al. 2011), only the A1B emission scenario was used to force the climate simulations. As suggested by Christensen et al. (2010), we consider the three sensitivity experiments of HadCM3–HadRM3 as different model combinations, as their climate response is highly variable (Collins et al. 2006).

Table 1 The ENSEMBLES simulation matrix of the 25 km runs until 2050

Many of the RCM simulations (10 out of 21) were driven by only two GCMs, namely ECHAM5 and HadCM3Q0. All GCMs, except CGCM3.1 and IPSL-CM4, drive at least two RCMs. Three RCMs were forced by multiple GCMs, namely DMI-HIRHAM, METNO-HIRHAM, and SMHI-RCA. The driving GCM data of ENSEMBLES was either obtained by the database of ENSEMBLES (http://ensemblesrt3.dmi.dk/) or CMIP3 (https://esg.llnl.gov:8443/). As we aim at extending the uncertainty analysis to unknown GCMs, we also take into account all additional available GCM simulations of CMIP3 forced by the A1B emission scenario. Altogether, we have 53 GCM simulations for air temperature and 50 for precipitation amount from a set of 27 GCMs (precipitation from 3 out of 5 simulations of the GISS-ER model are missing in the CMIP3 database). From this ensemble, 10 GCMs were started with different perturbed initial conditions and, therefore, cover uncertainty due to natural variability (see Table 1).

In order to be comparable to previous studies conducted within PRUDENCE and ENSEMBLES, we focus on the land grid points of eight European subregions according to Christensen and Christensen (2007): Iberian Peninsula (IP), Mediterranean (MD), France (FR), Middle Europe (ME), Alps (AL), Eastern Europe (EA), British Isles (BI), and Scandinavia (SC). In addition, we also provide information concerning entire Europe (EU). The selected subregions cover the diversity of main climate characteristics in Europe, ranging from arid climate conditions during summer in the southern European regions IP and MD to humid maritime climate characteristics in BI and SC (Heinrich and Gobiet 2011a). The focus is on seasonal mean CCSs, calculated as the difference between the two periods of 2021–2050 for the future period and 1961–1990 for the baseline period (except for the GKSS-CLM model which starts in 1963). The precipitation CCSs are calculated relatively with respect to the baseline period.

3 Data reconstruction methods

As we aim at deriving RCM responses based on their associated driving GCMs, we first highlight the effect of RCMs on the CCSs of their driving GCMs. Figure 1 shows the mean difference in the seasonal mean CCSs between RCMs and associated driving GCMs of the 21 available ENSEMBLES projections for air temperature and precipitation amount, respectively. The climate model data have been resampled to a common grid of 3.75° longitude and 2.5° latitude which is most common to the ENSEMBLES GCMs. A more detailed description of the resampling technique can be found in Suklitsch et al. (2008).

Fig. 1
figure 1

The mean difference in the seasonal CCSs between RCMs and associated driving GCMs of ENSEMBLES for air temperature (a) and precipitation amount (b). The CCSs are calculated between the two periods of 2021–2050 and 1961–1990. In each panel, top-left is winter (DJF), top-right is spring (MAM), bottom-left is summer (JJA), and bottom-right is autumn (SON)

Concerning air temperature, the RCMs feature smaller CCSs than their driving GCMs for most of Europe with highly pronounced differences in spring and summer, particularly in Eastern and Southern Europe. On subregional scale, EA and MD show the largest difference with −0.8 K in JJA (see Table S1 in the online supplementary material). For precipitation, the RCMs tend to feature larger CCSs (moister conditions) for large parts of Europe, particularly in JJA with the most pronounced differences of +6.5 % in FR, +5.4 % in EA, and +5.1 % in ME.

3.1 Additive method

First, we consider the additive reconstruction method introduced by Déqué et al. 2007 (D07). The reconstruction method is embedded in the framework of an analysis of variance (ANOVA), neglecting the highest interaction term in order to reconstruct the actual missing value. The reconstruction algorithm writes in case of the ENSEMBLES simulation matrix as follows:

$$\Updelta X_{ij} = \Updelta X_{i \bullet } + \Updelta X_{ \bullet j} - \Updelta X_{ \bullet \bullet } = \Updelta X_{ \bullet \bullet } + (\Updelta X_{i \bullet } - \Updelta X_{ \bullet \bullet } ) + (\Updelta X_{ \bullet j} - \Updelta X_{ \bullet \bullet } ), $$
(1)

where \(\Updelta X\) denotes the CCS of an RCM for a specific subregion, i is the index of the RCM (\(i = 1,\, \ldots ,17\)), and j the index of the driving GCM (\(j = 1, \ldots ,8\)). The dot operator denotes averaging across the corresponding indices. The reconstruction algorithm can be understood intuitively: consider RCM1 driven by a set of GCMs and RCM2 driven by the same GCMs except one. This missing value is then reconstructed by adding the mean difference between RCM2 and RCM1 to RCM1. Since the reconstruction of the missing values depends on the grand mean of the entire simulation matrix, 30 iterations are performed (Déqué et al. 2007).

3.2 Scaling methods

The main disadvantage of the D07 method is that it cannot be used for an extension to unknown GCMs, as there is no explicit formulation between the CCSs of RCMs and their associated driving GCMs. For such an extension, scaling techniques are widely applied and in this study we assess the capability of different scaling methods in reconstructing the RCM projections.

Scaling can be generally formulated as:

$$\Updelta X_{ij} = \Updelta Y_{j} \cdot \tilde{k} + \tilde{d} + \varepsilon_{ij} , $$
(2)

where \(\tilde{k}\) and \(\tilde{d}\) are the linear regression coefficients of a least-square-fit to the data. In order to account for the downscaling effect which is specific for a particular RCM, we introduce the additive term \(\varepsilon_{ij}\). In this study, three different formulations of the RCM specific downscaling effect \(\varepsilon_{ij}\) are investigated: (1) there exists no RCM specific downscaling effect with \(\varepsilon_{ij} = 0\) (SCA1), (2) the RCM specific downscaling effect is the mean RCM specific residual of the least-square-fit (SCA2), (3) the RCM specific downscaling effect is randomly sampled from a normal distribution with zero mean and standard deviation estimated from the regression residuals of all RCMs (SCA3).

In literature, the regression line is typically forced to cross the abscissa at zero (intercept \(\tilde{d} = 0\) in Eq. 2) which is argued to be reasonable if global mean temperature change is used as predictor for the regional climate response (e.g., Hingray et al. 2007; Ruosteenoja et al. 2007). However, the relevance of this assumption has not been assessed so far for a scaling relationship at the GCM scale and, therefore, we also consider a scaling relationship without intercept (SCA0). SCA0 can be regarded as degenerated linear regression with zero intercept, while scaling (Eq. 2) is a full linear regression with varying intercept and slope. We also note that RCM specific random sampling is not feasible as the majority of RCMs is only driven by a single GCM (see Table 1). In addition, the insufficient sample size at the RCM specific level does not allow for applying a hierarchical linear model (e.g., Gelman and Hill 2009) in order to estimate RCM specific regression lines.

Furthermore, we note that the random sampling approach SCA3 is embedded in the framework of multiple imputation (MI; Rubin 1987; Little and Rubin 2002), which aims at generating a set of multiple plausibly reconstructed data sets. Common statistical analysis is then applied to each individual data set, generating a set of parameter estimates of interest. The final parameter estimate is then achieved by simply taking the average of the individual realizations (Rubin 1987). As we are dealing with a large fraction of missing values, we generate a set of 1,000 reconstructed data sets throughout the study in order to ensure convergence of random sampling.

4 Design of the cross-validation and statistical analysis

4.1 Cross-validation

In order to compare the skill of the different reconstruction methods, we follow a cross-validation (CV) strategy. CV is often used to estimate the predictive skill of statistical models in application and has the further advantage that it can be applied to small sample sizes as it does not rely on asymptotic theory. In this study, we apply a leave-one-out cross-validation (LOOCV) method by fitting the statistical model to a training data set which consists of leaving out a data point from the original sample. This data point is then predicted by the statistical model and the procedure is repeated for each single data point in the sample, generating a completely reconstructed data set. As the focus of our study is on the estimation of expected seasonal mean climate change and its uncertainty, the LOOCV is used to assess the predictive skill of the reconstruction methods in estimating ensemble mean and standard deviation by comparing their match between original and reconstructed sample. Although an unbiased estimation of the ensemble mean is an important prerequisite for the further analysis of this study, the mean errors might compensate among the predictions and, therefore, the root mean square deviations (RMSD) between original and reconstructed CCSs are also assessed.

Cross-validation (CV) of the methods which are based on RCM specific information (D07 and SCA2) requires RCMs which are driven by more than one GCM. Unfortunately, only a subset of eight simulations of three RCMs were forced by multiple GCMs in ENSEMBLES (see Table 1) and the corresponding LOOCV results have to be regarded as a rather rough estimate of the skill due to the small sample size. However, the LOOCV based on this subset is the only way to compare the skill of all implemented reconstruction methods. All other reconstruction methods allow for a LOOCV based on the full sample of 21 simulations.

Concerning the LOOCV, we follow two strategies. First, we compute a LOOCV which is based on leaving out the CCS of a single RCM (LOOCV-RCM). As most of the GCMs are used as driving data for at least two RCMs, the entire GCM information is available for predicting the RCM change in most cases. Therefore, this strategy can be regarded as measuring the performance of deriving the missing CCSs of the ENSEMBLES simulation matrix (see Table 1). However, it might be argued that the influence of a single data point decreases as the sample size increases. In order to take this into account and as we also aim at extending the ENSEMBLES simulation matrix to unknown GCMs, we additionally apply a more stringent cross-validation which is achieved by leaving out the CCSs of a driving GCM and all associated RCMs as second LOOCV strategy (LOOCV-GCM). Compared to the LOOCV-RCM, the sample size of the training data sets for each of the LOOCV-GCM predictions is generally smaller. Furthermore, we note that the LOOCV-GCM is asymmetric in the sense that the number of neglected RCM simulations is GCM-dependent (up to almost 25 % of the data is neglected in case of ECHAM5 and HadCM3Q0) which consequently increases the independence among the training samples since the entire information of the driving GCM is removed for each of the RCM predictions.

4.2 Statistical analysis

4.2.1 Statistical significance

After reconstructing the missing values of the simulation matrix, we assess the statistical significance of the differences in expected seasonal mean climate change and the associated uncertainty between original and reconstructed ensemble. Statistical tests for differences in the mean generally require equal variances (Behrens–Fisher problem). However, this cannot be assumed here, as the reconstruction of the missing CCSs potentially changes the spread of the ensemble. Therefore, we apply a t test for unequal variances (Welch-test) under the null hypothesis that the ensemble mean of original and reconstructed ensemble are equal. The application of the t test requires independently and normally distributed samples. As we are dealing with rather small sample sizes, it is difficult to assess the normality of the data. However, the assumption of normality in our study is supported by the central limit theorem since multiple averaged quantities are applied. The statistical significance of the differences in the ensemble spread is assessed by applying the robust Fligner–Killeen test (Conover et al. 1981) under the null hypothesis that the ensemble variances of original and reconstructed ensemble are equal.

In order to eliminate possible dependencies among the RCM simulations due to their driving GCMs, we average across the RCMs according to their driving GCMs. For the same reason, we average across the different GCM runs of the CMIP3 database after reconstructing the RCM changes. Therefore, the sample sizes of original and reconstructed simulation matrices reduce to the number of GCMs indicated in Table 1 (8 for the ENSEMBLES simulation matrix and 27 for the extended matrix).

Significance levels lower than 3, 3 to 5, 5 to 10 %, and greater than 10 % are termed as strongly significant, significant, weakly significant, and insignificant, respectively.

4.2.2 Quantification of uncertainty

The uncertainties of the projected changes are quantified by two measures. Firstly, we calculate the ensemble standard deviation. As its calculation is based on rather small sample sizes in this study, we apply a minor bias correction as proposed in Knutti et al. (2010). Secondly, we calculate the percentage of models which coincide in the sign of change as a non-parametric uncertainty measure. Applying the confidence terminology defined by the intergovernmental panel on climate change (Solomon et al. 2007; Mastrandrea et al. 2010), very high confidence, high confidence, and medium confidence is obtained if at least 90 %, at least 80 %, and at least 50 % agree in the sign of the multi-model mean change, respectively.

5 Results of the cross-validation

Figures 2 and 3 depict the results of the LOOCV for air temperature and precipitation amount, respectively. The spread of the box-whisker plots displays variability among subregions. For comparative purposes, we only show the results of LOOCV-RCM based on the subset of RCM simulations driven by multiple GCMs which includes all reconstruction methods. All additional LOOCV results based on LOOCV-GCM and the full sample are provided in the online supplementary material, Fig. S1 to Fig. S6.

Fig. 2
figure 2

Results of the LOOCV-RCM for air temperature based on the subset of RCMs which are driven by multiple GCMs. Displayed are the differences between reconstructed and original ensemble mean (a) and standard deviation (b), and the associated root mean square deviations (c). The spread of the box-whisker plots shows variability among subregions and displayed are the 10th, 25th, 50th, 75th, and 90th percentile

Fig. 3
figure 3

Same as Fig. 2 but for precipitation amount

For air temperature, the largest differences between reconstructed and original ensemble mean are obtained for D07 concerning all seasons. However, the bias is rather small, showing mostly an underestimation of the ensemble mean. The largest bias of D07 is obtained in DJF with −0.06 K in the median and ranging from −0.13 K to −0.04 K. SCA0 shares this negative bias in DJF, but with a lower magnitude of −0.01 K in the median. For all other methods, biases in both directions are obtained with magnitudes less than 0.02 K in the median. The differences between SCA1 and SCA3 in the ensemble mean can be related to finite sampling of the residuals and converge to zero as the number of random draws increases. For the ensemble standard deviation, D07 generally shows an underestimation which is largest in DJF with −0.23 K in the median. Although the scaling methods generally perform better in reconstructing the ensemble standard deviation, underestimated variability is obtained as expected for the scaling relationships without random sampling of the residuals. Only the randomized scaling approach SCA3 is able to compensate for the underestimated standard deviation as additional variability is introduced due to random sampling of the residuals. The RMSD is generally lower for the scaling methods. The difference in the RMSD between SCA1 and SCA3 can be explained by the fact that additional variability is introduced due to random sampling of the residuals. More specifically, the difference in the squared RMSD between SCA1 and SCA3 can be calculated analytically and is given by the mean variance of the residuals among the LOOCV samples. All reconstruction methods show a seasonal dependency of the RMSD with the best and worst skill in DJF and SON, respectively. The differences between the various scaling methods are generally small (especially for SCA0 and SCA1).

For precipitation amount, D07 tends to overestimate (underestimate) the ensemble mean in DJF and JJA (MAM and SON). The biases are rather small with a peak magnitude in SON with −0.3 % in the median. For the scaling methods, the ensemble mean is underestimated in MAM and SON. SCA0 generally shows large biases in combination with rather large ranges among the subregions. The largest bias of SCA0 is found in SON with a median difference of −0.7 % and ranging from −3.0 % to +1.0 %. For the ensemble standard deviation, D07 again tends to underestimate the ensemble spread with the largest difference in the median of −1.8 % in SON. The scaling methods partly show a pronounced underestimation of variability. The largest difference in the median is obtained for SCA0 and SCA1 in JJA with a magnitude of −3.3 and −2.0 %, respectively. As already shown for air temperature, SCA3 again shows the overall best performance in reproducing the ensemble spread as random sampling of the residuals increases variability of the scaling relationship and consequently compensates for the underestimated ensemble standard deviation. For the RMSD, the scaling reconstruction methods consistently reveal the best skill in DJF while D07 shows the lowest RMSD in SON. The differences in the RMSD between SCA0 and SCA1 are again small. The worst skill for all reconstruction methods is obtained in JJA, peaking up to 8.3 % for D07. Here, the rather large difference between SCA1 and SCA3 can be related to large additional variability introduced by random sampling of the residuals and reasonably indicates large variability of the summer precipitation CCSs among the different RCM projections (e.g., Déqué et al. 2012).

The results of the second cross-validation strategy LOOCV-GCM, which is based on leaving out the CCSs of a driving GCM and all associated RCMs, underpin the results reported above. Extending the LOOCV to the entire sample, rather large negative precipitation biases are obtained for SCA0 in all seasons. Concerning the ensemble spread, SCA0 and SCA1 mostly underestimate variability and the superiority of SCA3 in reconstructing the ensemble spread is further confirmed (see online supplementary material, Fig. S1 to Fig. S6).

Based on the results of the LOOCV, we choose SCA3 as preferred reconstruction method, especially due to its overall best performance in reconstructing ensemble mean and variability. Table 2 summarises the results of the two LOOCV strategies for SCA3 based on the entire sample as further applied in the study. As expected, the results of the more stringent LOOCV-GCM strategy indicate a worse performance than the LOOCV-RCM strategy which can be related to the smaller size of the training samples and the asymmetric character of LOOCV-GCM. However, both CV strategies reveal differences in the ensemble mean and spread mostly clearly less than 0.1 K and 1.0 % for air temperature and precipitation amount, respectively. As these values provide a measure of the predictive skill of our statistical reconstruction method, we expect rather small errors introduced by the statistical model in prediction mode. As already discussed above, the RMSD values are inflated by additional variability due to random sampling of the residuals. The coefficient of determination (R2) is generally larger for air temperature than for precipitation amount, ranging from below 0.1 for precipitation amount for MD in JJA up to 0.9 for air temperature in several subregions and seasons. However, small R2 values are generally not related to large biases and/or differences in the ensemble spread, reflecting a potential instability of the R2 due to the rather small sample size. Figure 4 displays the linear fits between the CCSs of RCMs and associated driving GCMs as further applied in the study. As it can be seen, intercept and slope are both varying across subregions and seasons, further underpinning that the formulation of the zero intercept model (SCA0) may not be appropriate. This is generally in agreement with the results of the LOOCV. For example, air temperature reveals a distinct non-zero intercept in MAM for all subregions which consequently results in poor performance of SCA0 concerning both LOOCV strategies (see online supplementary material, Fig. S3 and Fig. S5).

Table 2 Results of the cross-validation for SCA3
Fig. 4
figure 4

Linear least-square fits between the seasonal CCSs of RCMs and associated driving GCMs for air temperature (a) and precipitation amount (b). In each panel, top-left is DJF, top-right is MAM, bottom-left is JJA, and bottom-right is SON

6 Revision of expected regional climate change and its uncertainty over Europe

Table 3a and b show the differences in ensemble mean and standard deviation between original and filled ENSEMBLES simulation matrix for air temperature and precipitation amount, respectively. For air temperature, the differences in the ensemble mean are mostly negative in DJF, JJA, and SON, indicating slightly larger CCSs of the original ensemble, while in MAM larger mean values of the filled ensemble are obtained for all subregions. However, the differences are rather small and insignificant, with maximum differences in the order of ±0.2 K. For the ensemble standard deviation, mostly lower values for the filled ensemble are obtained. Exceptions are EA in JJA and BI in DJF and the differences are again small and insignificant, peaking up to −0.5 K for SC in MAM. For precipitation amount, the differences in the ensemble mean vary across seasons and subregions. As for air temperature, the differences are again small and insignificant, with a maximum difference of +1.1 % for AL in DJF. Concerning the ensemble spread, mostly lower standard deviations for the filled ensemble are obtained. Exceptions are SC in DJF as well as FR and ME in MAM. The differences are again rather small and insignificant, peaking up to −4.4 % for FR in JJA.

Table 3 Ensemble seasonal mean changes (upper rows of each panel) and standard deviations (lower rows of each panel) between 2021–2050 and 1961–1990 of original and filled ENSEMBLES simulation matrix (panel a and b, respectively) and of original and extended ENSEMBLES simulation matrix (panel c and d, respectively)

Table 3c and d show the differences in ensemble mean and standard deviation between original and extended ensemble concerning all available GCM simulations of CMIP3 forced by the A1B emission scenario. For air temperature, the ensemble mean of the extended ensemble is mostly reduced, showing the largest differences in EA in JJA with −0.4 K. However, no statistical significance for the changes is obtained. Concerning the ensemble spread, the extended ensemble generally shows a reduction with the most pronounced changes in EA and SC in MAM with −0.5 K. Again, the changes are mostly insignificant. For precipitation amount, the sign of the difference in the ensemble mean is varying among subregions and seasons (only in SON mostly lower CCSs of the extended ensemble are obtained). The largest difference is −4.2 % in IP in DJF and the changes are again mostly insignificant. Concerning the ensemble spread, mostly lower ensemble standard deviations for the extended ensemble are obtained. The most pronounced change is obtained in AL in JJA with −3.3 % and the changes are again mostly insignificant.

We note that the main characteristics of the geographical pattern of the CCSs are not deteriorated through statistical reconstruction (also see online supplementary material, Fig. S7 and Fig. S8). These patterns include large spatial differences of winter air temperature change, which are explained by moderate warming of the ocean influencing the maritime climate of western Europe in combination with altered snow-albedo feedback mechanisms in northern and eastern Europe (Rowell 2005), high summer air temperatures in the south, which are related to an earlier and more rapid reduction of soil moisture in spring (e.g., Wetherald and Manabe 1995; Gregory et al. 1997), and a dipolar pattern of the precipitation change (decrease in the south and increase in the north), which can be related to a seasonal dependent northward shift of the mid-latitude storm track which is identified as the European Climate change Oscillation (ECO) (Giorgi and Coppola 2007). Although the focus of this study is on near-term projections until the mid-twentyfirst century, we stress that the projected changes are expected to further increase after the 2050s (e.g., Heinrich et al. 2013). In this respect, the projected precipitation changes are expected to further intensify until the end of the twentyfirst century with the most pronounced intensifications for the hot-spot regions in southern and northern Europe. Furthermore, the confidence of the projected precipitation changes generally increases until the end of the twentyfirst century. However, low confidence remains especially along the transition zone from drier conditions in southern Europe to wetter conditions in northern Europe.

Figure 5 depicts maps for the confidence levels of the precipitation changes for the original, filled, and extended ENSEMBLES simulation matrix. Maps for air temperature are not shown, since very high confidence of warming is achieved in all subregions with and without reconstruction (see online supplementary material, Fig. S9). The confidence of the projected precipitation changes are partly reduced for the filled ENSEMBLES simulation matrix in DJF, MAM, and SON, revealing overconfident projections of the original ensemble. Concerning the extended ENSEMBLES simulation matrix, both lower and larger confidence levels are obtained, revealing under- as well as overconfident projections of the original ensemble. For example, larger confidence levels are obtained for the changes in MD in SON, while in BI, EA, IP, and ME the confidence is reduced. However, the differences between original, filled, and extended ENSEMBLES ensemble are generally small and the overall picture of high confidence in the northern- and southernmost European regions with low confidence in-between is further underpinned.

Fig. 5
figure 5

Confidence of the projected precipitation changes between 2021–2050 and 1961–1990 for the original (a), filled (b), and extended (c) ENSEMBLES simulation matrix. Green, yellow, and red colours display very high confidence, high confidence, and medium or no confidence, respectively. The numbers indicate the percentage of models which agree in the sign of the ensemble mean. Percentages lower than 50 % indicate skewed distributions. In each panel, top-left is DJF, top-right is MAM, bottom-left is JJA, and bottom-right is SON

7 Summary and conclusions

This study assesses and compensates the GCM sampling bias in expected regional mean climate change and the associated uncertainty of the ENSEMBLES RCM projections by combining them with the full set of the CMIP3 GCM ensemble. The focus was on eight European subregions and the seasonal mean changes of the two key climate variables air temperature and precipitation amount until the mid-twentyfirst century were assessed.

In order to underpin the importance of reconstructing RCM responses, we first highlighted the impact of RCMs on the CCSs of their driving GCMs and showed that downscaling generally leads to less warming (up to a reduction of 30 % of the GCM CCS) and partly to more precipitation over Europe than projected by GCMs. We introduced and evaluated various statistical data reconstruction methods which mimic these effects and allow for an extension of the RCM ensemble to additional driving GCMs. The most appropriate reconstruction method, a randomized scaling approach with errors in ensemble mean and standard deviation mostly less than 0.1 K and 1.0 % for air temperature and precipitation amount, respectively, was applied to fill the missing values of the ENSEMBLES simulation matrix and to further extend the matrix by all available CMIP3 GCM simulations forced by the A1B emission scenario. Differences between original, filled, and extended ENSEMBLES simulation matrix were assessed in order to identify potential ensemble biases and improperly estimated uncertainty ranges due to the GCM sampling strategy of ENSEMBLES. The key findings of the intercomparison can be summarised as follows: (1) The estimated mean climate change for air temperature and precipitation amount over Europe is not significantly altered due to reconstruction in almost all seasons and subregions. (2) The estimated uncertainty is generally not increased by extension of the ENSEMBLES simulation matrix to the entire CMIP3 ensemble. In some seasons and subregions it is even significantly reduced. We note that no weighting regarding model performance was applied in our study. Although our framework would allow for the incorporation of such weights, the benefits of performance weighting in constraining uncertainty of future projections could yet not be sufficiently demonstrated (e.g., Knutti et al. 2010; Déqué and Somot (2010).

From the results of our analysis we conclude that expected mean climate change and the associated uncertainty of the ENSEMBLES RCM projections are both not underestimated using only few driving GCMs. In contrast, the ensemble spread of the extended ensemble is partly significantly lower than that of the original ensemble. Therefore, this study substantially adds to the reliability of numerous recent climate change impact studies over Europe which use the full range or a carefully selected subset of the ENSEMBLES projections (e.g., Heinrich and Gobiet 2011a; Finger et al. 2012), since it confirms that the ensemble is not significantly biased and the uncertainty is not underestimated due to GCM sub-sampling.