Streamflow-based evaluation of climate model sub-selection methods

The assessment of climate change and its impact relies on the ensemble of models available and/or sub-selected. However, an assessment of the validity of simulated climate change impacts is not straightforward because historical data is commonly used for bias-adjustment, to select ensemble members or to define a baseline against which impacts are compared—and, naturally, there are no observations to evaluate future projections. We hypothesize that historical streamflow observations contain valuable information to investigate practices for the selection of model ensembles. The Danube River at Vienna is used as a case study, with EURO-CORDEX climate simulations driving the COSERO hydrological model. For each selection method, we compare observed to simulated streamflow shift from the reference period (1960–1989) to the evaluation period (1990–2014). Comparison against no selection shows that an informed selection of ensemble members improves the quantification of climate change impacts. However, the selection method matters, with model selection based on hindcasted climate or streamflow alone is misleading, while methods that maintain the diversity and information content of the full ensemble are favorable. Prior to carrying out climate impact assessments, we propose splitting the long-term historical data and using it to test climatemodel performance, sub-selectionmethods, and their agreement in reproducing the indicator of interest, which further provide the expectable benchmark of nearand far-future impact assessments. This test is well-suited to be applied in multi-basin experiments to obtain better understanding of uncertainty propagation and more universal recommendations regarding uncertainty reduction in hydrological impact studies.


Introduction
It is a common practice to analyze multiple ensemble members for climate change impact studies, which is important to account for the variability and uncertainty in the projections (Melsen et al. 2018;Krysanova et al. 2017). Use of a large model ensemble of climate projections is favored for impact modeling with the aim to quantify the inherent uncertainties in the projections (Clark et al. 2016;Pechlivanidis et al. 2017;Samaniego et al. 2017). However, in impact studies, the use of a large number of climate models can be computationally and methodologically intensive, while a smaller number of climate-impact models may be necessary to provide users with a manageable set for decision-making (Evans et al. 2013). Therefore, it is not always possible to present all available ensemble members, e.g., when complex model cascades are used (Kiesel et al. 2019a) or when multiple other sources of uncertainty are of interest as well (Clark et al. 2016). While the practice of sub-selecting large model ensembles has been criticized in the past (Mote et al. 2011;Christensen et al. 2010), it is now a generally accepted approach (Eyring et al. 2019;Herger et al. 2018;Knutti et al. 2017). This is mainly due to the fact that informed model sampling and reduction of informational redundancy in the ensemble are expected to improve climate change impact assessments (Eyring et al. 2019;Pechlivanidis et al. 2018). A number of methods have been proposed to deal with the multi-ensemble problem, e.g., selections based on best-performing climate depiction (Ruane and McDermid 2017); best representation of the target variable, e.g., streamflow (Kiesel et al. 2019b); keeping model diversity and independence (Abramowitz et al. 2019); weighted ensemble members driven by the model performance (Knutti et al. 2017); and trading-off the information content, gain, and redundancy in the ensemble set (Pechlivanidis et al. 2018).
Whether to exclude or weigh the available models according to different criteria still remains a subjective exercise. Eyring et al. (2019) provides a recent overview of suitable measures to assess the climate model performance in order to exclude and/or weigh the available models. However, to date no comparison of these selection methods using historical climate change impacts has been performed. This is significant since the effect on climate impact results depends on the selection method, i.e., exclusion/weighting method, available models, and performance criteria used for setting the weights (Knutti et al. 2010;Evans et al. 2013).
A comparison of selection methods is possible if evaluation data are available. Such data has to include the combined influence of meteorological forcing, spatial heterogeneity of the climate model's performance (Kotlarski et al. 2014), and the long time periods over which change occurs. Streamflow of large and complex (e.g., alpine) river basins integrates climatic factors due to the spatiotemporal interaction between meteorological and hydrological processes. Even though future climate is expected to be more pronounced than what has already been observed, from a hydrological viewpoint, climate change has already had an observable impact in various basins (Blöschl et al. 2017). We argue that historic streamflow observations can be used as an evaluation criterion to assess (1) hindcasted climate model skill and (2) ensemble selection methods. Such an evaluation requires the simulation of streamflow based on hindcasted temperature and precipitation data from climate models. On one hand, this simplifies the climate model assessment from two parameters (precipitation, temperature) to one (streamflow), while on the other hand, this also adds an additional layer of uncertainty-the hydrological model and its associated data (Clark et al. 2016). Pechlivanidis et al. (2017) and Thober et al. (2018) show that uncertainty (both related to climate and hydrological models) depends on the climatic conditions and is generally higher in the dry than in the wet regions, while uncertainty in high (low) flows is related more (slightly less) to the climate models compared with the hydrological models. Melsen et al. (2018) come to similar conclusions, but at the same time show that hydrological model choice and parameterization can influence the direction of the projected change. Therefore, attributing the simulated streamflow change signal clearly to a change in climate either requires the consideration of the hydrological model and parameterization uncertainty in the analysis (Addor et al. 2014;Melsen et al. 2018;Vetter et al. 2017) or a rigorous five-step evaluation of the model's performance (Krysanova et al. 2018).
Considering the above, this paper's objectives are (1) to carry out a rigorous hydrological model evaluation to ensure suitability for climate change impact assessments, (2) to assess how well different climate models are able to reproduce actual climate change impact on streamflow, and (3) to provide an evaluation of eight climate model sub-selection methods.

Study area
The Upper Danube up to the gauge Vienna has a basin area of 101,810 km 2 where annual precipitation ranges from below 500 mm in the eastern lowlands to up to 3000 mm in the Alps. Dominant land uses are forest, grassland, and agriculture, whereas bare areas and glaciers are located in the high alpine regions. The basin has a high elevation gradient ranging from about 200 to 4000 m a.s.l. Overall, the basin is very heterogenous, with alpine sub-basins such as the Inn or lowland basins such as the Naab and their distinct seasonal streamflow regimes ( Fig. 1). At Vienna, the alpine influence with higher streamflow in summer prevails. In the alpine basin of the Inn (gauges Oberaudorf and Hofkirchen), streamflow in summer has been reduced, while streamflow in winter has increased, attributed to an increase in warm years within the recent decades (Kling et al. 2012b;Stanzel and Kling 2018). Due to this, the Upper Danube's streamflow has undergone a significant shift in seasonality showing that climate change has already impacted the basin's hydrological processes.

Hydrological model
The model COSERO (Kling et al. 2015) is a conceptual hydrological model that includes the simulation of snowfall and melt processes, positive and negative glacier mass balances, actual evapotranspiration, fast runoff, and baseflow components, as well as reservoir storages and releases. In the application of the model for the Upper Danube (Kling et al. 2012a), the model was forced with HISTALP (Auer et al. 2007;Böhm et al. 2009) climate data, available at 30 stations of monthly time series of precipitation, temperature, and sunshine duration from 1800 to 2014. For precipitation, HISTALP was merged with annual 1-km 2 -gridded precipitation data (Kling et al. 2007), while temperature was further regionalized by a linear regression model based on elevation. Potential evapotranspiration (PET) was calculated using HISTALP sunshine duration, the Angstrom equation (Angstrom 1924), and the Turc PET method (Turc 1961). The model was discretized in 16 sub-basins, where each sub-basin outlet is located at a streamflow gauge (Fig. 1). The sub-basins were further distinguished into 61 hydrological response units (HRUs) according to elevation bands of 500 m elevation difference. COSERO was applied on a monthly time step from 1800 to 2014. Model parameters were calibrated for the time period 1961-1990 for 16 gauges combining manual and automatic calibration methods using the Kling-Gupta efficiency (KGE', Kling et al. 2012a) and its subcomponents as the objective function. Detailed evaluation of model simulations with observed streamflow data was carried out in the independent evaluation periods 1901-1930, 1931-1960, and 1991-2007 as well as in unusually warm years.

Climate change data
Monthly precipitation and monthly temperature were acquired from 16 climate models from the EURO-CORDEX initiative (Table 1) (Jacob et al. 2014). We chose the Representative Concentration Pathway (RCP) 8.5 emission scenario since recorded emissions since 2006 have followed the RCP8.5 trajectory closest (Sanford et al. 2014). COSERO was forced with both the raw and bias-adjusted EURO-CORDEX data. For the bias-adjustment, linear scaling was applied on both precipitation and temperature data for the time period 1961-1990 prior to forcing the COSERO model with the data. PET was calculated with the Turc PET method for each climate model temperature dataset, assuming global radiation values of the early twentyfirst century. From the model runs, hindcasted streamflow from 1960 to 2014 was obtained (note that 1970-2014 was used for the SMHI-RCA GCM family of models). For each hindcasted streamflow time series, the average annual change in streamflow from the 1960-1989 (reference) to the 1990-2014 (evaluation) period (ΔQ in Table 1) was calculated. This ΔQ is used as the target parameter to select the median model from different sub-ensembles.

Hydrological model evaluation
Through a rigorous model evaluation, in case it is successful, the confidence in the results of the climate change impact assessment can be increased (Huang et al. (this SI); Gelfan et al. (this SI)). This is especially important since here we only consider one hydrological model. Therefore, the evaluation of the calibrated model is carried out according to the following fivestep criteria proposed by Krysanova et al. (2018): Quality of observational data (step 1): Observed climate input data used to parameterize the model, particularly temperature and precipitation, has to be spatially representative, homogeneous over the whole time period, and of high quality. Streamflow data has to be observed with up-to-date rating curves, be consistent between upstream and downstream gauges, and available for the full calibration and validation period. Good performance for periods with different climate (step 2): The hydrological model is able to reproduce the streamflow situation in exceptionally dry and wet as well as hot and cold years. Model performance is therefore assessed separately for the 10 warmest, 10 coldest, 10 wettest, and 10 driest years within the period 1901 to 2007 at the gauge Vienna. Good performance at multiple sites and for multiple variables (step 3): The Upper Danube is a heterogeneous basin ranging from alpine, pre-alpine to lowland regions. Since the climate model's spatial representation of precipitation and temperature is of key  (0), model is selected within a subset (> 0), and finally selected models (= median of the subset) for the evaluated models are marked bold: Dem = model democracy, DivG = diversity of GCMs, DivR = diversity of RCMs, MIMR = maximum information minimum redundancy, bCl = best hindcasted climate, bSf = best hindcasted streamflow, sWgt = simple weights, REA = reliability ensemble average importance in mountainous basins as the Upper Danube, the hydrological model has to adequately perform in all regions to not cause a spatial bias in the evaluation. The mass balance of the glaciers in the high alpine region of the Upper Danube is an additional variable that has been evaluated. While the contribution of glacier mass balance is less than 1% of the annual streamflow at Vienna, it is nevertheless an important indicator to assess the model's process depiction in high-alpine regions that generally contribute most of the streamflow. Having a realistic simulation of glacier mass balance ensures that the distinction of precipitation in rainfall and snowfall, as well as simulation of snow-and ice-melt, is plausible. Reproducibility of the hydrological indicator of interest (step 4): Due to increasing temperatures within the last 60 years, streamflow seasonality at Vienna has been flattened with a reduction in summer streamflow and an increase or no change in the remaining seasons. Here, we test the potential of the hydrological model forced with observed climate data to reproduce the long-term inter-annual seasonal shift. Reproducibility of significant trends in the streamflow (step 5): An ordinary Mann-Kendall trend test (Mann 1945;Kendall 1955; implemented in Hussain and Mahmud 2019) is applied on the observed and simulated streamflow data at Vienna. It is then compared whether and to what extent the significant trends and insignificant trends are matching in both or not.

Impact of climate change on precipitation, temperature, and streamflow
We investigate the impact of climate change that occurred from the reference to the evaluation period for observed and hindcasted data. Therefore, we calculate the average change as well as the spatial variability (standard deviation of change over the 16 sub-basins) in temperature, precipitation, and streamflow. This analysis provides insights into how strong temperature, precipitation, or the combination of both are linked to streamflow change.

Ensemble selection methods
Eight different sub-selection methods are applied using data from the reference period and from the 16 raw (not bias-adjusted) EURO-CORDEX datasets (Table 1): Democracy (Dem): all models are treated equally (ensemble of opportunity, see e.g., IPCC 2013) where the median model is selected from the full ensemble and, if the number of ensemble members is even, is calculated as the mean between two members; Diversity of GCM (DivG): from each GCM group, selecting the model that best represents hindcasted climate seasonality leads to a subset of seven models from which the median is selected; Diversity of RCM (DivR): same as for DivG but here RCMs are used instead of GCMs; Maximum Information Minimum Redundancy method (MIMR): the MIMR (Li et al. 2012) method is implemented to select the members that maximize independence and minimize redundancy and contain more than 90% of the hydro-climatic spatial information; Best performing hindcasted climate depiction (bCl): the model that best represents hindcasted climate seasonality is chosen; Best performing hydrological model (bSf): from all COSERO realizations driven by the raw hindcasted climate data, the model that best represents the hindcasted streamflow seasonality is chosen; Simple climate model weighing (sWGT): models are ranked according to hindcasted climate performance where the worst model is used once in the ensemble and, the subsequently better model one more time, leading to an ensemble set of 136 members from which the median model is selected; Reliability Ensemble Average (REA): model weights (see details regarding weighting calculations in the Appendix) are increased depending on historical climate performance of the raw climate model data as well as the consensus of models in the future regarding the hydrological indicator of interest (Tebaldi and Knutti 2007).

Evaluation of model ensembles
Each of the 16 EURO-CORDEX models, all medians from all possible model combinations, and the eight ensemble selection and weighing methods (Table 1) are evaluated according to how well they capture the difference in simulated streamflow change within the four seasons. The change in simulated streamflow based on the observed climate is used as benchmark to ensure that the evaluated difference is only attributed to the climate change signal, while any hydrological model impacts are excluded. For the evaluation, the change in streamflow simulated with hindcasted climate (henceforth designated "hindcasted") within each season from the reference period  and the evaluation period  is therefore compared with the change in streamflow simulated with observed climate (henceforth designated "observed"). The average RMSE over the four seasons between the observed and simulated change is selected as the measure of agreement.

Hydrological model performance assessment
Here, we assess the hydrological model performance using the five-step evaluation approach (see section 3.1).
Step 1: Climate data is based on the high-quality HISTALP dataset, whose development started in the early 1990s, involved 33 organizations and data providers from 15 countries. The HISTALP initiative focuses on long time series utilizing all available systematically observed data, a dense network adequate for the spatial heterogeneity, outliers removed, gaps filled, and accounting for changes in measurement techniques. Regarding streamflow, the 16 gauges used are key gauges of the most important rivers in the Upper Danube basin. Hydrographic Services carry out multiple streamflow measurements per year to keep the rating curve up-to-date. In addition, streamflow values are verified with upstream-and downstream gauges to avoid any possible inconsistencies. The period of available data is different for different gauges, with data available for all gauges from 1954 to 2007. See also Kling et al. (2012a and b). We therefore conclude that data quality is acceptable and does not hamper achieving good model performance.
Step 2: For different climate conditions, the model performs consistently well over the different climatic conditions of the 10 driest, 10 warmest, 10 coldest, and 10 wettest years at the gauge Vienna, with the lowest KGE' of 0.85 for the driest years and the highest KGE' of 0.9 for the warmest years (see Table 2).
Step 3: The model performs well across the 16 sub-basins with the KGE' varying from 0.81 to 0.97 (Table 2). The model shows a tendency to perform better for the snowdominated alpine sub-basins (south of the main stem of the Upper Danube; Fig. 1) with all KGE' values being greater than 0.9 in comparison to the pre-alpine and lowland region (all KGE' values < 0.9). Besides streamflow, the agreement between observed and simulated change in glacier mass is investigated. Observations of glacier mass are available for the Vernagtferner glacier (Weber 2003) which is located in subbasin 9 (Fig. 1). The COSERO model simulates glacier processes on the full subbasin scale and, besides the Vernagtferner, includes multiple other glaciers in subbasin 9. Therefore, a comparison between observed and simulated glacier mass is only indirectly possible. Figure 2 a shows a comparison between relative change in glacier mass from 1840 to 2000. The underestimation of simulated glacier mass loss can be due to the lumped glacier depiction which is compared with data from one glacier only. This higher observed mass change compared with other glaciers in the subbasin seems realistic since the Vernagtferner shows characteristics which increase its vulnerability to melting processes: The Vernagtferner has largely lost its protective and comparably shallow and wide firn-body and has significant areas with south- Table 2 Performance statistics for the full time series of the 16 gauges and for the 10 warmest, 10 coldest, 10 wettest, and 10 driest years of the 107-year period at Vienna (KGE' = modified Kling-Gupta efficiency, r = correlation, ß = bias ratio, γ = variability ratio; all dimensionless; ideal value at unity, equations given in Kling et al. 2012a facing slopes and multiple, rather short and wide, glacier tongues (Braun and Escher-Vetter 2013).
Step 4: The model is able to reproduce well the observed streamflow seasonal shift from the reference to the evaluation period; however, there is a slight overestimation of streamflow change in the winter period (Dec.-Feb.) and a slight underestimation of the change in the autumn period (Sep.-Nov.; Fig. 2b).
In addition, on the monthly time step, the model is able to adequately reproduce the absolute streamflow as well as the change from the reference to the evaluation period (Fig. 2c).
Step 5 Driven by the above results, we conclude that the Upper Danube COSERO model is suitable for a climate change impact assessment.

Impact of climate change
Here, we investigate the changes in temperature, precipitation, and streamflow using both observed and simulated data (Fig. 3). Temperature increase is consistently shown in all 16 models and the observations across all seasons as well as for the annual average (Fig. 3a). Compared with the observations, most models predict a low-temperature increase in spring (Mar.-May). The increase in temperature magnitudes varies between the models; e.g., model 11 (MPI-REMO + MPIr2) indicates the lowest temperature increase, while model 7 (IPSL-WRF + IPSL) indicates the highest temperature increases across the Upper Danube basin.
Precipitation is subject to a more heterogeneous change from the reference to the evaluation period than temperature. Observed precipitation increases in summer and autumn and decreases slightly in winter and spring. However, no model is capable of fully reproducing this change. In spring, all models predict increased precipitation, while for the remaining seasons and the average annual changes, precipitation both increases and decreases depending on the climate model. Model 8 (KNMI-RACMO + ICHEC) shows a similar pattern in precipitation as the observations (increase in autumn and no significant change over the remaining seasons) and also a consistent increase in temperature during all seasons. Consequently, this model performs best in predicting the streamflow change.
Most climate model-based simulations predict a reduction in streamflow for summer and autumn and an increase in winter and spring. For the annual average, the projected streamflow change varies between climate models. Observed streamflow changes show a strong reduction in summer, a slight reduction in spring (MAM), and increases in the other seasons. The observed annual average streamflow slightly decreases from the reference to the evaluation period.
Reduction in precipitation in winter (e.g., models 4, 10, and 16) does not directly impact winter streamflow, while rising temperatures and subsequently more rainfall and less snowfall in winter cause an increase in winter streamflow. Increasing temperatures mean less accumulation of water in the snowpack and earlier onset of snowmelt in spring, with the overall effect Fig. 3 Information about the GCM and RCM for the mean statistic of the hydro-climatic variables and for different seasons: a mean relative change (%) from the 16 sub-basins from reference to evaluation period (note for temperature the actual change (°C) × 10 is used to fit the color scale) and b standard deviation in relative change (%) from reference to evaluation period which shows the heterogeneity across the 16 sub-basins (note for temperature the actual standard deviation (°C) × 30 is used to fit the color scale) (Projection ID on x-axis refers to model nr.-column in Table 1; Ob refers to observed data) of decreased streamflow in summer. The increasing temperatures and consequently increasing potential evapotranspiration in the summer period causes a further reduction in streamflow.
Spatial heterogeneity of observed streamflow and precipitation change across the Upper Danube is highest during spring and lowest during summer, while that of observed temperature change is generally low (Fig. 3b). This indicates that temperatures from reference to evaluation period have changed similarly across the basin, while precipitation has changed in a more spatially distributed manner. Modeled temperature change is more variable across the 16 sub-basins during winter and less variable during summer and autumn and also for the annual average. Models 2, 10, and 16 show the lowest temperature variability over the seasons. Apart for model 3 and the observation, there is no clear relationship projected between the change in precipitation and streamflow. Overall, all models have difficulties on reproducing the observed spatial change pattern across the basin.

Hindcasted climate sensitivity
Here, we investigated if change in precipitation and temperature can be linked to a change in streamflow. The spatial pattern of change for temperature, precipitation, and streamflow in the Upper Danube is complex due to alpine processes and the spatial physiographic heterogeneity. These processes are valuable for the model selection, as for instance, for a model to perform well in predicting the observed streamflow change in the Upper Danube, it seems important that the projected increase in winter and spring temperature is correct (models 6 and 8). However, inferring the ability in predicting streamflow change based on temperature and precipitation change is challenging for the remaining models (models 1-5, 7, 8-16). These results highlight the importance to evaluate streamflow as a target variable directly and may explain why precipitation and temperature alone provide insufficient information to select climate models in a complex basin such as the Upper Danube. We therefore investigated the link between a model's potential to reproduce historical climate seasonality and streamflow change. The RMSE values of the observed against hindcasted climate seasonality for the period 1961-1990 in the Upper Danube for all EURO-CORDEX models are available from Stanzel and Kling (2018). These RMSE values for precipitation and temperature were ranked and compared with the rank of the RMSE of the streamflow change for all GCM-RCM (Table 3) to assess if models that depict past climate well are more likely to depict the streamflow change. Figure 4 shows that no relation exists between a good model for temperature, precipitation, or both combined, to the models that best predict the streamflow change.

Identification of model subsets
Applying the eight different sub-selection methods leads to a different selection of EURO-CORDEX models for each method (Table 1, bold marked numbers). Overall, average seasonal RMSE values range between 38 and 241 m 3 /s across all models, which is quite significant given that the actual change signal lies in the same order of magnitude (+ 36.2 to − 242.7 m 3 /s).
While for the methods DivG, DivR, bCl, bSf, and sWgt exactly one median model is selected, the remaining methods lead to a selection of multiple models. The ensemble of opportunity (Dem) method selects model 6 (DMI-HIRHAM+ICHEC) and model 9 (KNMI-RACMO+MOHC), while model 3 (CLMcom-CCLM + MOHC) and model 16 (SMHI-RCA + MPI) are selected by MIMR (note that according to the latter method those two models contain about 90% of the information content of the large ensemble set; see the sum of red dots per column in Fig. 5). Finally based on the REA method, change is calculated using all models with the according weights (right column in Table 1). Table 3 RMSE between the observed (simulated streamflow with observed climate) and hindcasted streamflow change, according rank (1 = best) and seasonal streamflow change from the reference to the evaluation period for each climate model and sub-selection method; bold font indicates the correct depiction of the observed direction of change within the season, italic font indicates the correct depiction of the size proportion of the change over all seasons  Fig. 4 Ranks of all GCM and RCM models for RMSE between observed and simulated streamflow change (xaxis) and for RMSE between hindcasted and observed climate seasonality; rank 1 denotes best representation, TMP temperature, PCP precipitation Streamflow change from the reference to the evaluation period was calculated for all selected subset models (Table 3). Streamflow change results of the sub-selection methods MIMR, model democracy (Dem), and REA are not directly related to one specific GCM-RCM model (first 3 rows in Table 3). The medians of all other methods correspond to results of one model from the subsets of all GCM-RCM models. None of the medians and means selected from the EURO-CORDEX models reproduce the observed change better than model 8 (KNMI-RACMO + ICHEC, Table 3). Model 9 (KNMI-RACMO + MOHC), model 6 (DMI-HIRHAM + ICHEC), and the median resulting from the democracy (Dem) selection method are the only ones able to correctly reproduce the direction of change in all seasons. Moreover, only model 16 (SMHI-RCA + MPI) is capable of reproducing the correct relative change, with positive (negative) change being highest in autumn (summer), and also the change in winter being higher than the change in spring (marked italics in Table 3).
The streamflow simulations from the EURO-CORDEX models and all their possible medians (colored lines in Fig. 6) show a wide range of simulated streamflow change between the two periods and within the different seasons. Some models are able to generally reproduce the seasonal change pattern. However, most models predict very high positive streamflow changes from winter to summer and very low or even negative streamflow changes in autumn.
All selection methods-apart from the best streamflow (bSf) and best climate (bCl) methods-seem to constrain the simulations to the ones that are capable of reproducing the seasonal change pattern (visual inspection of Fig. 6). The best performing methods are simple weights (sWgt), MIMR, and diversity of RCMs and GCMs (DivG and DivR) ( Table 3). The simple weights (sWgt) method selects model 8 (KNMI-RACMO + ICHEC) as the best-performing model with an RMSE of 38 m 3 /s. The best streamflow (bSf) method led to model 11 (MPI-REMO + MPIr2), which has the second-worst streamflow change depiction (RMSE of 238 m 3 /s). Between ranks 4 and 8, the RMSE model performance deteriorates from 91 to 99 m 3 /s, indicating that the DivG, REA, and Dem methods result in similar errors. Finally, RMSE significantly increases for the remaining methods (bCl and bSf) and models.  Table 1). Boxes with a red dot indicate membership to a subset representing > 90% of the total information from the large ensemble of 16 members. Shades of gray relate to the ranking of the climate model in terms of information content for the respective parameter and season (from 1 to 16 with 1 being the most important)

Inter-comparison of sub-selection methods
We generally show that it is not recommended to select climate models for impact studies based on the depiction of past climate alone. This is also in agreement with expectations from Knutti et al. (2010) concluding that historically best performing models may not be the best for reproducing the climate change signal. On the contrary, Padron et al. (2018) conclude that simulations that spatially agree better with observations of long-term mean precipitation are likely more reliable for future projections. We found that no relation exists between how well the models are able to reproduce past climate and how well they depict the change signal from the reference to the evaluation period (see also Giorgi and Coppola 2010).
We followed a simplified approach to maintain diversity by selecting different GCM and RCM without analyzing common algorithms, boundary conditions, and model assumptions (Abramowitz et al. 2019). However, methods that considered multiple models or maintained the diversity of models (weighing all models based on climate seasonality, diversity of RCM, MIMR) generally performed better than the selection of single models, even if those models reproduce well the indicator of interest (in this case streamflow seasonality).
The concept of respecting model diversity originates from the key assumption that errors level out if models are diverse, and hence a projection can be better represented through averaging, i.e., taking the mean, median, or weighting (Tebaldi and Knutti 2007). This conclusion is also supported by Thober and Samaniego (2014). Pechlivanidis et al. (2018) discuss this issue in more detail and argue that it is important to consider both diversity and information content in ensemble model sub-selection. In addition, they argue that downweighing or excluding models that are introducing redundancy (i.e., due to model structure similarities) can lead to robust climate change projections. Results from this paper support these statements considering also that the MIMR method performed second-best overall.
Being able to reproduce the indicator of interest in the past is no guarantee that the models will be able to predict future changes similarly well, as results from the inferior performance of the best streamflow (bSf) method show. Conclusions largely depend on how well past climate change processes, model boundary conditions, and forcing datasets are implemented in the climate models. We believe that the dynamic error between observed and simulated shift in streamflow cannot be attributed to the static biases inherented in the EURO-CORDEX projections (Kotlarski et al. 2014). Our results support this given that (1) biases were adjusted prior to the streamflow simulation and (2) models with higher biases did not perform worse than others.
In addition, we conclude that models that are not able to predict the shift in streamflow seasonality in the past are not particularly well suited to depict an expected, similar or probably aggravated shift in the future. Consequently, the methods evaluated in this study set boundaries to exclude certain models, and favoring others, for impact assessments.
Overall, we recommend splitting historic observations into a reference and evaluation period to understand recent impacts of climate change on the indicator of interest. The observed change in the indicator of interest can then be used to test hindcasted climate data, assess and rank the quality of the climate models, and eventually weigh their contribution for impact assessment, following the suggestion of Eyring et al. (2019).

Regionalization of methodology beyond case study
The spatiotemporal complexity of the Upper Danube basin allows rigorous testing of the climate models' suitability to force a hydrological model. In this river system, there is a gradient in the hydrological response with both temperature and precipitation strongly affecting streamflow. Consequently, assessing the performance of the climate models is both important and possible. However, we refrain from general recommendations on which methods to use for ensemble sub-selection or which climate models perform best overall due to the single case study and the use of only one hydrological model. It thus remains unclear if and how the ranked sub-selection methods would differ if another (a set of) hydrological model(s) would be used or if a different study area would be chosen. Nevertheless, by selecting a complex basin and carrying out the rigorous hydrological model evaluation, we expect to have minimized these impacts as much as possible. Our approach is simple and suitable to be tested beyond the study region and in multi-basin experiments, but requires the availability of multiple suitable basins with high-quality, long-term datasets and hydrological models to be sure the climate change signal can be distinguished from observation errors, hydrological model uncertainty, and the natural variability. Such data could be, for instance, based on Blöschl et al. (2017), who found that climate change has caused hydrological change across Europe in thousands of basins, or using multi-basin datasets with long-term hydrometeorological datasets for the USA (Newman et al. 2015;Addor et al. 2017).
In addition to the choice of the climate model selection tested here, our presented evaluation method can be used to test inherent uncertainties, i.e., driven by the choice of the emission scenario (Addor et al. 2014;Samaniego et al. 2017), initial conditions, downscaling (Wagner et al. 2015) and bias-adjustment methods (Teutschbein and Seibert 2013), hydrological model structure and its parameterization , and the indicator and region of interest (Clark et al. 2016).

Conclusions
This study focuses on the analysis of eight methods for sub-selection of an ensemble of 16 bias-adjusted EURO-CORDEX climate models used in hydrological impact assessments. We test the methods in the Upper Danube basin, where climate change has already affected the streamflow response. To reduce the uncertainty introduced by the hydrological model selection and its parameterization, the model was rigorously evaluated. In particular, we found that: (1) The range of changes projected by the different sub-selection methods varies as much as the actual change signal, hence, choosing the best performing sub-selection method matters.
(2) Of the eight ensemble sub-selection methods tested, those maintaining/maximizing diversity and information content outperformed those that relied on the reproduction of historical climate or streamflow.
(3) None of the climate models used here was able to correctly reproduce the spatial heterogeneity of historical precipitation and temperature change in the Upper Danube. (4) Change in precipitation and temperature impact streamflow through complex hydrological processes. These can hardly be disentangled through looking at precipitation and temperature alone but can be reproduced by a thoroughly evaluated hydrological model that adequately represents the hydrological processes. (5) For river systems in which climate change already impacted the hydrological response and sufficient data are available, it is valuable to use this information about historical climate change impact to test climate model performance. For such regions, the proposed methods could provide a possible option to classify and assess the expected boundaries of climate change impact studies for the near and far future.
Finally, we state that future research should focus on testing the proposed methods using multiple hydrological models in multiple basins located under a strong hydro-climatic gradient (e.g., arid regions, monsoon influenced regions, snow-dominated or temperate climates) to further yield conclusions in relevance to sub-selection methods, climate/hydrological model performance, and uncertainty propagation.
Funding Open Access funding enabled and organized by Projekt DEAL. This study was funded through the "GLANCE" project (Global change effects on river ecosystems; 01LN1320A) supported by the German Federal Ministry of Education and Research (BMBF). This work was also partially funded by the project AQUACLEW, which is part of ERA4CS, an ERA-NET initiated by JPI Climate, and funded by FORMAS (SE), DLR (DE), BMWFW (AT), IFD (DK), MINECO (ES), and ANR (FR) with co-funding by the European Commission (grant agreement 690462).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.