1 Introduction

Problems related to freshwater are getting increasingly important globally, so that model-based hydrological projections for the future are of considerable social relevance. There are still recognized weaknesses and uncertainties in climate modeling and in large-scale hydrological modeling leading to large spreads of climate impact projections. Therefore, improvement of credibility of projections and reduction of their uncertainty are urgently needed. However, the progress achieved in the last three decades has been huge, even if reduction of projection uncertainty has been limited. Nevertheless, gradual movement from the category of “unknown unknowns” to “unknown knowns” is definitely advantageous.

In order to grasp the size of progress, it is interesting to take recourse to a pioneering paper by Russell and Miller (1990), which showed that general circulation models (GCMs) can be directly used to roughly calculate runoff for the major rivers of the world. This was very important, because in order to gather trust in projections for the future, it is necessary to demonstrate that the models can cope with mimicking historical reference situation. However, comparison of the simulated and observed mean annual runoff for 33 large rivers of the world presented by Russell and Miller (1990) showed very large differences. For example, the modeled mean annual runoff for the Orange was 123 km3/year, as compared to the observed value of 11 km3/year; hence, the difference was more than 11-fold. For the Amazon, it was 2332 vs 6300 km3/year. Smallest differences between the modeled and observed values, below 10%, were reported for the Amur and the St Lawrence rivers. Large errors were present already in modeled precipitation. For instance, the modeled precipitation for the Yellow river basin was 1407 km3/year, vs the mean observed value of 547 km3/year. The modeling process involved an amplification of the precipitation error to arrive at a much greater, in relative terms, river-discharge error. Nevertheless, it was just a start that demonstrated that progress is needed in various aspects of the modeling process, including development of basin-wide parameters in the models (Miller et al. 1994). These early papers on the direct use of GCMs were regarded as a benchmark to compare developments.

Now, the GCMs are not directly used for simulation of river runoff but serve as drivers of global-, continental-, and catchment-scale hydrological models, which simulate terrestrial hydrological cycle including river discharge. However, there is a scale mismatch between the large-scale climate models and the catchment-scale hydrological models driven by them, which needs a solution. Water is managed at the catchment scale, and adaptation to changing conditions is being done at the regional or local scales, while global climate models work on large spatial grids of 2–3°, so that climate model outputs are usually downscaled and bias corrected (Krysanova et al. 2016). The use of ensembles of climate models for projecting climate impacts, i.e., doing multiple hydrological model simulations driven by outputs of several climate model runs in order to represent the range of possible futures under the climate scenario of concern, started about 15 years ago (e.g., Milly et al. 2005; Nohara et al. 2006), and now, it is an established approach.

However, until almost a decade ago, studies reported in scientific literature used mostly a single hydrological model to assess climate change impacts on different aspects related to water resources. In the meantime, the progress of hardware and software has made it possible to perform and repeat, within a short time, complex calculations that use large quantities of data and numerous equations, so that the computational barriers largely disappeared. The use of multi-model ensembles of hydrological models (HMs) for projecting impacts along with the ensembles of climate scenarios started in the Water Model Intercomparison Project (WaterMIP, see Haddeland et al. 2011; Hagemann et al. 2012), and followed in the Inter-Sectoral Impact Model Intercomparison Project (ISIMIP, e.g., Dankers et al. 2014; Prudhomme et al. 2013), and now, it is quite common. The critical barriers are nowadays related to data availability and understanding of processes implemented in models, unlike in the past, when the computational burden used to hamper progress (Krysanova et al. 2016).

In addition to the ensemble approach, another important issue is related to consideration (or lack of consideration) of hydrological model performance in reproducing observed data. Two approaches can be distinguished in recent climate change impact studies regarding issues related to performance of HMs in the historical period: (1) using an unweighted multi-model ensemble disregarding their performance (recommended in Christensen et al. 2010; Gudmundsson et al. 2012), and (2) applying models or ensembles of models after their evaluation and taking into account their performance (advised in Prudhomme et al. 2011; Roudier et al. 2016; Krysanova and Hattermann 2017). Traditionally, the first approach is mostly used in global- and continental-scale studies, and the second one is widely applied at the regional and catchment scales.

Several authors (Coron et al. 2012; Krysanova et al. 2016; Donnelly et al. 2016) noted that calibration and validation of a hydrological model are important before applying it for impact assessment, in order to improve its performance and reduce the uncertainty of impacts. The importance of model performance is closely connected with the necessity of model evaluation, which reveals how well the models perform in the historical reference period against observed river discharge and other variables. The model evaluation could be done for the calibrated and validated models applied at the regional or catchment scales (see Gelfan et al. 2020), and also for uncalibrated models, which are usually applied at the global scale (see overview of papers in Krysanova et al. 2020). For a comparison of the performances of the regional- and global-scale hydrological models, see Hattermann et al. (2017). The term “evaluation” can have different meanings in relation to hydrological models. Sometimes, it is understood as equivalent to “calibration and validation,” and sometimes as a step applied independently of calibration (after it or without it).

In the paper on hydrological model performance, Krysanova et al. (2018) discussed both approaches and confirmed the hypothesis that “a good performance of hydrological models increases confidence of projected climate change impacts, and decreases uncertainty of projections related to hydrological models” based on analysis of pros and cons presented in the referenced papers and examples from recent impact studies. Besides, they suggested new five-step guidelines for evaluation of catchment- and global-scale hydrological models in the historical period, as well as criteria for model rejection from a multi-model ensemble as a poorly performing outlier.

This Special Issue (SI) is needed to test the suggested model evaluation guidelines and to analyze their effects on climate impact results. The main objectives of this SI are as follows: (a) to test the five-step comprehensive model calibration/validation procedure (Krysanova et al. 2018) for the regional-scale hydrological models, (b) to evaluate performance of the global-scale hydrological models, and (c) to reveal whether the calibration/validation methods and model evaluation results influence climate impacts in terms of the magnitude of the change signal and uncertainty ranges. This was done in several papers using the regional-scale models, by comparing the impacts and projection spreads based on a conventional simplified calibration/validation (only for discharge at the basin outlet) and a comprehensive model evaluation. Then, if the effect was notable, and since the comprehensive evaluation includes special robustness tests for future climate, the model version based on it should be considered as more suitable for impact assessment. We expect that a comprehensive model evaluation in comparison with the commonly used simplified approach can lead to more credible climate impact results with a lower uncertainty of projections related to HMs. This hypothesis was tested in the papers of this SI.

2 Overview of studies, case study areas, and data

Twelve thematic papers are included in this Special Issue. Eight of them are focused on the main research question posed in the title: whether and how impact model evaluation influences results of assessment of climate impact on water. Seven papers of those eight investigate the influence of model calibration/validation methods on simulated impacts in terms of the long-term mean annual and mean monthly changes in river discharge, as well as effects on projection spreads related to hydrological models. In most cases, a conventional (that is, simple) calibration/validation approach is compared against an enhanced (or comprehensive) approach, based on the five-step model evaluation suggested in Krysanova et al. (2018).

One paper is devoted to a systematic evaluation of global water models in the Arctic region, though impact assessment is not included (Gädeke et al. 2020). The focus of three remaining papers is on other specific research questions, such as testing a new calibration method for a large region in Africa (Chawanda et al. 2020), selection of climate ensemble members based on simulated streamflow (Kiesel et al. 2020), and reviewing sources of uncertainty in climate impact projections (Dankers and Kundzewicz, 2020), but all three have indirect relations to the main topic of this SI.

An overview of the 11 research papers (except for the 12th review paper) listing their case study areas, models applied, calibration/validation approaches, and main foci is given in Table 1.

Table 1 An overview of modeling studies included in the Special Issue (abbreviations: U. upper, HM hydrological model, LSM land surface model)

2.1 Case study areas

All case study areas tackled in the papers of this Special Issue are presented on maps in Figs. 1 and 2. Ten river basins located in Europe, Asia, North and South America, and the Southern African region, including two large drainage basins of the rivers Orange and Limpopo, where various regional-scale models were calibrated/validated and applied for climate impact studies are shown in Fig. 1. The main characteristics of these basins and the region are presented in Table 2.

Fig. 1
figure 1

Case study areas with application of regional-scale models: the drainage basins of the rivers Rhine, Upper Danube, Upper Mississippi, Pajeú (a sub-basin of São Francisco), Mackenzie, Lena, Upper Indus, Godavari (until Tekra), Upper Yellow, and Upper Yangtze and the Southern African region

Fig. 2
figure 2

Case study areas with application of continental- and global-scale models: pan-European domain and 58 large river basins distributed among eight hydrobelts. Names are added only for river basins larger than 470,000 km2

Table 2 Characteristics of study areas where the regional-scale models were applied. T temperature, P precipitation, Q discharge, NML northern mid-latitude, NST northern subtropical, SST southern subtropical, SDR southern dry, SML southern mid-latitude hydrobelts

Figure 2 presents the pan-European domain (about 79% of the European continent, excluding some areas in the eastern part that drain to the Caspian Sea but including Turkey and a small portion of Middle East) where the continental-scale model E-HYPE (Hundecha et al. 2020) was applied, and 58 large river basins on six continents for which the global-scale models were evaluated. These 58 basins are distributed among eight hydrobelts, as defined by Meybeck et al. (2013) (see Fig. 1 in Krysanova et al. 2020). Compared to the Köppen classification of climate zones, the classification of land areas in hydrobelts considers watershed boundaries and other geo-hydrological factors, and therefore, it lends itself well to applications in hydrological modeling. The areas of all 58 basins are larger than 50,000 km2, matching the crude output resolution of the global models, which is 30′ (0.5° × 0.5° latitude-longitude resolution). The drainage areas of 14 of the 58 basins are less than 100,000 km2, and 10 basins have drainage areas larger than 1 million square kilometers. The characteristics of these basins can be found in Table S1 in Krysanova et al. (2020) and Table 1 in Gädeke et al. (2020).

The case study areas where the regional-scale models were applied (Fig. 1) belong to six hydrobelts in total, whereas six basins are located in the northern mid-latitude belt (Table 2). The two largest basins for regional-scale applications are the Lena and the Mackenzie (Fig. 1). The area of the Southern African region is also larger than a million of square kilometers. The smallest is the Pajeú catchment, a sub-basin of the São Francisco basin. The mean and maximum elevations are smallest in the Upper Mississippi basin. In contrast, the Upper Yellow river basin is mountainous, with elevations above 2673 m a.s.l.

Climatic conditions are also quite different in the basins where regional-scale models were applied (Table 2). The average annual temperature is above 25 °C in the Godavari and the Pajeú catchments, and below 0 °C in three catchments, of the rivers Mackenzie, Lena, and Upper Yellow. The average annual precipitation ranges from less than 500 mm (in three basins and in Southern Africa) to more than 1000 mm (in three basins). The long-term average runoff ranges from about 10 mm a−1 in Southern Africa to 589 mm a−1 in the Upper Danube, and the values of runoff coefficient range from 0.02 (Southern Africa) to 0.53 (Upper Danube). Average runoff in the Upper Indus is larger than average precipitation due to significant contribution of melt water from glacial and snow melt: about 70% (Ismail et al. 2020), and therefore, runoff coefficient cannot be defined for this basin. It can be seen that the set of the regional case study basins captures a variety of climatic and hydrological conditions.

2.2 Climate data and scenarios

The following climate data were used in the research reported in the papers of this Special Issue. For evaluation of the regional-scale models in the historical period, three reanalysis datasets were used: EWEMBI (Lange 2018) for the basins of the Upper Indus, Lena, Mackenzie, and Upper Yangtze and for the Southern African region (Ismail et al. Gelfan et al. Wen et al. Chawanda et al. this issue); WATCH (Weedon et al. 2011) for the basins of the Rhine, Upper Mississippi, Upper Yellow, and Pajeú (Huang et al. 2020, Koch et al. 2020); and WFDEI (Weedon et al. 2014) for the Godavari (Mishra et al. 2020).

In turn, global hydrological models (gHMs) were evaluated using model runs driven by WFDEI and GSWP3 (Kim et al. 2014) data for 57 river basins worldwide (Krysanova et al. 2020) and by WATCH, WFDEI, GSWP3, and Princeton (Sheffield et al. 2006) data for six Arctic basins (Gädeke et al. 2020). The continental-scale model E-HYPE applied EFAS-Meteo data (Ntegeka et al. 2013) with resolution of 5 km (Hundecha et al. 2020).

Most regional-scale climate impact studies and the impact assessment for 12 large basins with gHMs were performed using four GCMs from ISIMIP2b: GFDL-ESM2M, HadGEM2-ES, IPSL-CM5A-LR, and MIROC5. Only three studies (Huang et al. Koch et al. and Gelfan et al. this issue) applied five GCMs from ISIMIP2a: HadGEM2-ES, IPSL-CM5A-LR, MIROC280 ESM-CHEM, GFDL-ESM2M, and NorESM1-M. The continental-scale impact assessment (Hundecha et al. 2020) was driven by five climate scenarios from Euro-CORDEX. More details, including the hydrological and geospatial data, can be found in the cited papers.

3 Models and methods

3.1 Models applied

Global hydrological models are designed for the continental and global scales and are usually applied with a crude resolution of 0.5° without calibration. Regional- or catchment-scale hydrological models have much finer resolution, because they are intended for simulation of catchment characteristics using local input data and applying calibration to observations. Both types of models represent major components of the hydrological cycle, but the level of detail in their description is usually higher in the regional models.

In total, nine regional hydrological models: COSERO, ECOMAG, HBV-D, SRM-G, SWAT, SWAT+, SWIM, VIC, and VIC-Glacier; one continental-scale model: E-HYPE; and ten global-scale models, including four gHMs (H08, MPI-HM, PCR-GLOBWB, and WaterGAP2), five land surface models (LSMs: DBH, JULES-W1, MATSIRO, ORCHIDEE, and SWAP), and one dynamic global vegetation model (LPJmL), were applied in the studies reported in this SI. The LSMs and LPJmL include the full hydrological cycle with water routing, and in that sense, they can be treated as gHMs as well. The references to all these models and papers from the SI, where they were applied, are presented in Table 3. Three of the regional models: SWAT, SWIM, and VIC were applied for 5–6 river basins.

Table 3 Hydrological models applied in this Special Issue

Five catchment-scale hydrological models ECOMAG, HBV, SWAT, SWIM, and VIC are described in Table 2 in Krysanova and Hattermann (2017). The SWAT+ is a restructured version of SWAT, and VIC-Glacier is an extended version of VIC, including glacier processes. The semi-distributed model SRM-G was designed to simulate runoff in the snowmelt-dominated regions. COSERO is a conceptual hydrological model including glacier mass balance and reservoirs. SWAP is a global-scale LSM applied and calibrated at the catchment scale for two large Arctic basins.

The semi-distributed process-based hydrological model HYPE simulates components of the water cycle and water quality at the catchment scale. It was set-up for the pan-European domain (Donnelly et al. 2016) and is referred to as E-HYPE.

Table 2 in Gädeke et al. (2020) presents nine global water models evaluated for 57 or six Arctic river basins (Fig. 2). Only one of these nine global models, WaterGAP2, was calibrated (Müller Schmied et al. 2014). More details can be found in the papers cited in Table 3.

3.2 Approaches for model calibration/validation

At first glance, calibration and validation of hydrological models seem to be well-established procedure: a differential split-sample test (DSS, first proposed by Klemeš 1986) applied for multiple gauges and two or three variables. However, this procedure is very rarely applied rigorously in climate impact studies, especially for large river basins, where in most cases a simple split-sample test only for discharge at the catchment outlet is used, or the models are applied without any calibration, as it is usually done for gHMs (Dankers et al. 2014; Hattermann et al. 2017).

Besides, the traditional DSS test may be insufficient for checking the preparedness of hydrological models for impact studies. Thus, Refsgaard et al. (2013) suggested a framework for testing models additionally using proxies of future climate conditions, which could be constructed from data in the historical period. Another option is to test models under contrasting historical climate conditions (e.g., Coron et al. 2012; Gelfan and Millionshchikova 2018). Also, Thirel et al. (2015) suggested special protocols for testing models under changing climate conditions. Additionally, Beven and Smith (2015) recommended to include evaluation of observational data quality and take it into account during calibration/validation.

Summarizing the previous recommendations in literature and based on own experience, Krysanova et al. (2018) suggested five steps for a comprehensive calibration/validation of the catchment-scale models intended for impact studies:

  1. 1

    Evaluate the quality of observational data and take it into account during the model calibration/validation;

  2. 2

    Apply a differential split-sample test for calibration/validation to optimize the model simultaneously for periods with different climates, or check for proxy climate;

  3. 3

    Validate model performance at multiple sites and for multiple variables to ensure internal consistency of the simulated processes;

  4. 4

    Validate whether the model can reproduce the hydrological indicator(s) of interest to be used for impact assessment;

  5. 5

    Validate for any observed trends (or lack of trends) in discharge, whether they are adequately reproduced by the model.

The evaluation of data quality could be useful for interpretation of calibration results, e.g., some weaker results could be explained by poor data quality. The periods with contrasting climates for the DSS test could be, for example, sub-periods with (i) warmer and drier years and (ii) colder and wetter years than the average climate (see more details in Gelfan et al. and Huang et al. 2020). The global datasets on evapotranspiration, snow cover, etc. could be used for step 3. The validation at steps 4 and 5 could be performed for the total historical period with data. Step 4 could be omitted, if indicators of interest are already included in steps 2 or 3.

For the global-scale models, Krysanova et al. (2018) did not propose calibration, but suggested a simplified five-step model evaluation procedure, where step 2 was substituted by testing the model performance in historical period and steps 3–5 were also weakened. Besides, it was suggested to apply the spatially dependent model performance criteria, assuming that gHMs cannot produce equally plausible results globally.

3.3 Approaches applied in SI for model evaluation and analysis of projections

All applied approaches could be divided in three categories: for regional-scale studies, continental-scale assessment, and multi-basin studies with gHMs, and shortly described as follows:

  1. A)

    Regional-scale studies

    • A simple calibration/validation approach (only for discharge at the basin outlet) and a comprehensive five-step model evaluation method were applied at the catchment scale for seven basins in five papers (Huang et al. 2020, Wen et al. 2020, Ismail et al. 2020, Mishra et al. 2020, and Koch et al. 2020). In the first three papers, models were also evaluated for periods with contrasting climates. After that, models with two different parametrizations were applied for climate impact assessment, and the projections were analyzed: whether these two parametrizations influence signals of change in terms of the long-term mean annual and mean monthly flows and (in some cases) extremes. If differences were below 5%, they were considered negligible, and differences higher than 5% (10%) were considered as notable (moderate).

    • Three versions of two models were analyzed by Gelfan et al. (2020) for the Lena and Mackenzie basins: non-calibrated versions A with a priori parameters, versions B calibrated against daily streamflow at the basin outlets, and versions C calibrated against daily streamflow at multiple gauges. For that, a slightly modified comprehensive evaluation procedure compared to that in Krysanova et al. (2018) was applied. The model robustness was evaluated for the climatically contrasting periods, and effects on future projections were compared.

    • Climate model sub-selection methods were assessed for the Upper Danube basin, based on the simulated streamflow with COSERO, which was calibrated/validated using the five-step enhanced method and contrasting climate periods (Kiesel et al. 2020).

    • A new method of Hydrological Mass Balance Calibration based on global datasets of climate, discharge, evapotranspiration, reservoirs, and irrigation was applied for the Southern African region, and its influence on SWAT+ performance and climate projections was analyzed (Chawanda et al. 2020).

  1. B)

    A continental-scale study

    • Three model calibration approaches were applied for the pan-European domain using E-HYPE: calibration for major river basins (37 gauges, minimum size 5000 km2, model version BM), regionalization through calibration at smaller catchments of tributaries (57 gauges, minimum size 1000 km2, model version M00), and building an ensemble of ten model versions from M00 through parameter sampling (<MXX>). All model versions were applied for projecting climate change impacts, and differences were analyzed for various indicators on the annual and seasonal basis (Hundecha et al. 2020).

  1. C)

    Multi-basin studies with the global-scale models

    • Evaluation of performance of six global water models in the historical period was done for 57 large river basins on six continents, considering monthly and long-term mean monthly dynamics of discharge at outlets, based on four common metrics summarized to an aggregated index (AI, ranging from 0 to 1, the higher the better) (Krysanova et al. 2020). Next, a comparison of projected impacts in terms of magnitude of change and spreads was performed for 12 selected river basins using (i) all models, applying the ensemble mean approach (EM), and (ii) only satisfactorily performing models, applying weighting coefficients (WCO) estimated based on the evaluation results.

    • A systematic performance evaluation of nine global water models was carried out for six major Pan-Arctic watersheds by Gädeke et al. (2020), considering different hydrological indicators: monthly and long-term mean monthly discharge (for multiple gauges), high and low flow extremes (for the outlets), and snow water equivalent. For that, a similar as above aggregated performance index (API, in %, from 0 to 100) based on the usually used criteria was applied.

4 Main findings

Here, results of the comprehensive evaluation of the regional-scale models and evaluation of the continental- and global-scale models, and effects of model evaluation methods and results on future projections and uncertainty ranges are summarized.

4.1 Model evaluation

4.1.1 Enhanced evaluation of the regional models

It was possible to calibrate and validate different catchment-scale models for various basins using the enhanced 5-step method with satisfactory to good results in most cases (see Table 4 with qualitative assessment of results). The evaluation of models ECOMAG and SWAP calibrated at several gauges for the Lena and Mackenzie was also successful. Only in a few cases, results were weak or poor: for the Yellow and Rhine modeled by SWAT, and for the headwater gauge Zhimenda of the Yangtze modeled by VIC and HBV-D.

Table 4 Performance of the regional-scale models in the case study basins for river discharge in the validation period based on the comprehensive evaluation method: qualitative assessment based on KGE, NSE, and PBIAS criteria

4.1.2 Evaluation of the continental- and global-scale models

The performance of the benchmark model version BM calibrated for the major river basins in the pan-European domain was slightly worse than that of the other versions (M00 and <MXX>) calibrated for smaller catchments, both in terms of NSE and PBIAS (Hundecha et al. 2020). The median NSE at the calibration stations for BM and M00 are 0.39 and 0.59, respectively. In terms of PBIAS, both versions underestimate the mean flow with median values of − 17% and − 7% for BM and M00, respectively. The model performance in terms of NSE is similar in the validation period, with median NSE of 0.43 and 0.58 for BM and M00, respectively.

The evaluation results of six global models for 57 river basins (Fig. 2) varied between models and basins (Krysanova et al. 2020). The performance, averaged over six gHMs, was satisfactory or good (average AI > 0.5) in eight basins of 57, whereas 42 basins showed poor performance with average AI ≤ 0.4. Table 5 shows an overview of the average performance of six gHMs in terms of the aggregated index for 57 basins. WaterGAP2 was the best performing model (average AI of 0.67), followed by MATSIRO and PCR-GLOBWB with average AIs of 0.28 and 0.26, respectively. The remaining three models showed quite poor performance with significant overestimation of discharge and the amplitude of seasonal dynamics, with median NSE values for monthly discharge being well below zero.

Table 5 An overview of performance of six global hydrological models in terms of the average aggregated index (AI) for 57 large river basins on six continents

The evaluation results of nine gHMs over six Arctic watersheds (Gädeke et al. 2020) were similarly weak: the average aggregated performance index API exceeded 50% only for one basin of six (Kolyma) for the monthly and seasonal discharge, and for one basin (Ob) for the high and low flows. WaterGAP2 had the highest API (72%) averaged over all basins for the monthly/seasonal discharge, and MATSIRO had an average API > 50% for the monthly/seasonal discharge and extremes. Two more models showed average API > 50% in two cases: MPI-HM for the monthly/seasonal discharge and LPJmL for high flows. Remaining 21 out of 27 cases demonstrated weak or poor performance.

4.2 Influence of model evaluation methods/results on impacts and uncertainties at different scales

The influence of model calibration/validation methods (simple and comprehensive) on the projected impacts and uncertainties was investigated in several papers by comparing the impacts and projection spreads. Also, the effect of model evaluation results on impacts and uncertainties was analyzed for several differently calibrated model versions (including the non-calibrated models in one study) in two papers: for the Lena, Mackenzie, and pan-European domain. Besides, the influence of model evaluation results on the projections simulated by gHMs was analyzed, applying and comparing the EM and WCO methods (see Section 3.3).

4.2.1 Regional-scale studies

Three river basins investigated in Huang et al. (2020) had different sensitivities of projections to two model parametrizations. Comparison of results has shown moderate to strong influences on the ensemble medians and means of discharge for the Upper Mississippi (differences up to 23%), minor to moderate effects for the Upper Yellow (differences up to 16%), and smaller effects for the Rhine (maximum 7%). For the Mississippi, two calibration methods even led to contradictory signals of change (positive and negative) in terms of mean/median. The shares of uncertainty related to HMs decreased for three hydrological indicators in all basins after the enhanced calibration, except for the high and low flows in the Rhine. However, when SWAT was excluded from the ensemble for this basin due to its poor performance, the shares of the HM uncertainty decreased in all cases. Thus, we can conclude that even a single poorly performing model could substantially increase the HM share of uncertainty and the total uncertainty of projections.

In the Godavari basin, the influence of calibration/validation methods on the projected mean annual discharge and high flows was minor for three gauge stations, including the outlet, and notable for one gauge, Bamini (43% of the total area): about 10% difference was found for the projected changes in mean annual flow, and 14% for high flows (Mishra et al. 2020). However, for high flow frequency, considerable influence was noticed: up to 35–40% differences at three gauges, including the outlet.

For the Upper Yangtze basin, comparison of impacts based on the simple and comprehensive calibration methods was done for three hydrological models (Wen et al. 2020). The simulated increases in mean annual discharge at the end of the twenty-first century related to the reference period were approximately doubled based on the simple calibration, in comparison to those based on the comprehensive method, with the mean annual differences of 7–8% under RCPs 2.6 and 8.5. The same tendency was found for high flow. For low flow, changes in different directions were simulated based on two methods under RCPs 2.6 and 4.5, and differences were up to 15%.

The study for the Upper Indus basin with two HMs has shown notable differences in impacts at the annual scale (8–10%) based on two methods for RCPs 2.6 and 8.5 in the mid-century and far future periods (Ismail et al. 2020). The median changes based on two methods differed in sign in all periods under RCP2.6 and in the near future for RCP8.5. At the monthly scale, the largest differences were found in March (− 17%) and October–November (18–19%) in the far future under RCP8.5. The uncertainty contribution from HMs based on the enhanced method was larger in the near future but became negligible and smaller in the far future, in comparison with the conventional method.

Differences in impacts and uncertainties were analyzed for the Lena and Mackenzie river basins (Gelfan et al. 2020), based on simulations of three versions of two models (see Section 3.3). Both models simulated increase in the mean annual discharge for the Lena by 43% on average at the end of this century based on the non-calibrated A versions, whereas application of the calibrated versions B and C resulted in 24% and 19% increases, respectively. The corresponding uncertainty spreads were 125%, 63%, and 37% for A, B, and C, respectively. Similar differences were found for the Mackenzie. It was found that A projections differed quite significantly from B and C projections in both basins: by 10–22%, whereas projections based on B and C diverged by 5–6% only.

For the Pajeú catchment in a semi-arid area in Brazil, differences between the projected changes in the long-term mean annual/monthly discharges, averaged over all GCMs based on two differently calibrated model versions, were analyzed for the near future 2021–2050 (Koch et al. 2020). The differences were rather low under RCP8.5 and slightly higher (up to 8%) under RCP2.6. The analysis of projections based on two model versions for the Serrinha II Reservoir revealed notable differences between the projected changes in maximum mean monthly volume, up to 9–10% under both RCPs, and in mean reservoir discharge, 10% under RCP2.6.

4.2.2 Continental-scale study

In the study for the pan-European domain, the impacts simulated by the benchmark model, BM, differed distinctly from those of model versions M00 and <MXX> for different indicators, whereas they were similar for the latter two (Hundecha et al. 2020). The median changes projected for mean annual discharge by BM in the whole area were close to zero but biased towards negative changes, whereas M00 projected a moderate increase (median value of 8%) and a wetter pattern. The projected changes in soil moisture had similarly shifted distributions, with a major portion of the distribution being negative for BM, and with the 95th percentile change being positive and about 10% higher for M00. Regarding seasonal changes, all model projections showed a strong increase in discharge in winter: the median increase ranged between 30 and 39% for BM and M00, respectively. In summer, the median increase in aridity projected by BM was nearly twice as much as that of M00 (23% vs 12%). The absolute differences in the projections of the M00 and BM versions ranged between 0 and 55% for the mean annual discharge.

4.2.3 Global-scale models: comparison of two approaches

Comparison of climate change impacts simulated by gHMs was performed for 12 of 18 selected river basins using the EM and WCO approaches (see Section 3.3) (Krysanova et al. 2020), whereas the second approach was based on the model evaluation results. The following results were obtained: (a) impact assessment with WCOs was not possible in six basins (~ 33%) due to poor performance of all models, (b) comparison of impacts resulted in small or negligible differences in four basins (~ 22%), and (c) differences in mean monthly discharge, mean annual discharge, or both were moderate to large in eight basins (~ 44%). A comparison of projection spreads was done for 12 basins, considering 25 to 75% percentiles and full uncertainty ranges. It was found that the spreads were of similar size in four basins; they decreased slightly or moderately (by 15–50%) when the WCO method was applied in five basins, and decreased significantly (by 51–82%) based on the WCO method in three basins.

5 Summary and conclusions

5.1 Model evaluation results

The regional-scale models were successfully calibrated and validated for various basins using the comprehensive 5-step method with satisfactory to good results in most cases. The performance of the benchmark model version of E-HYPE, calibrated for 37 major river basins of the pan-European domain, was only slightly worse than that of the model version calibrated for 57 tributary catchments.

The evaluation of six gHMs for 57 river basins showed satisfactory to good results of WaterGAP2 (acceptable index in 75% of the basins), weaker results of MATSIRO and PCR-GLOBWB (23–26%), and rather poor results of the other three models (only 11–16% of the basins with acceptable indices). The performance, averaged over models, was good or satisfactory only in eight river basins of 57 (14%). Similar results were obtained in another paper, evaluating global water models in the Arctic basins, where WaterGAP2 and MATSIRO showed better results than other models. It was found that the majority of global models exhibited considerable difficulties in realistically representing observed hydrological processes in many basins.

5.2 Influence of model evaluation methods and results on impacts and uncertainties

The influence of model evaluation methods on the projected impacts and uncertainties was investigated in a number of catchments by comparing the impacts and projection spreads. In most cases, notable to moderate differences in projected impacts were found, and in some cases, the differences were stronger. In some basins, application of models with two different parametrizations even led to changes in opposite directions in future periods. The studies, which analyzed projection spreads, concluded that, after comprehensive calibration/validation, models tend to reduce spreads related to HMs.

The influence of results of comprehensive evaluation of HMs on impacts and uncertainties was analyzed applying two models in two large river basins, using three differently parametrized versions of each model. The simulated impacts and uncertainty spreads were essentially different between the model versions which successfully passed the evaluation test and the versions which failed to pass it. This allowed authors to conclude (Gelfan et al. 2020) that the successful comprehensive evaluation of model versions increases confidence in their projections in comparison with projections of model versions which did not pass the test.

The influence of model evaluation results on impacts and uncertainties was also analyzed for gHMs, applying the traditional ensemble mean method and the approach with weighting coefficients. In most cases, application of the WCO approach based on model evaluation and using only models with satisfactory or good performance resulted in different projections with reduced spreads, compared to the projections using the EM approach, based on all models disregarding their performance. However, the impacts and projection spreads were quite similar for the basins where most gHMs showed acceptable or good performance.

5.3 Updated guidelines for calibration/validation

Based on the experience gained in the practical application of the five evaluation steps in seven papers of this SI, they have been further developed and could be slightly re-formulated as follows:

  1. 1.

    Evaluate the quality of observational data and take it into account during the model evaluation for interpretation of results;

  2. 2.

    Apply a differential split-sample test (or its modification) for discharge at multiple gauges to calibrate/validate the model simultaneously for periods with contrasting climates;

  3. 3.

    Validate model performance for 1–2 additional variables (e.g., evapotranspiration, snow) to ensure internal consistency of the simulated processes; if not successful, return to step 2;

  4. 4.

    Validate whether the model can reproduce hydrological indicator(s) of interest for impact assessment (e.g., extremes); if not successful, return to step 2;

  5. 5.

    Check whether the observed trends (or lack of trends) are reproduced by the model.

The main modifications are as follows: strengthening of point 2 and adding a possibility of iterations if validation at steps 3 or 4 is not successful.

5.4 Grappling with uncertainties

Dankers and Kundzewicz (2020) reviewed the sources of uncertainty in the recent projections of climate change impacts on water resources. Since, in addition to GCM uncertainty, structural uncertainty of impact models can be a significant component of the overall uncertainty, studies that are based on a single hydrological model may well be overconfident and do not adequately sample the uncertainty range, even if they use multiple driving climate models. It may be possible to reduce the spread in multi-model ensemble results by down-weighting or eliminating models that are unable to mimic observed components of the water cycle in a particular catchment. Since uncertainty in applications of large-scale models is greater at the smaller scale of a river basin, this provides a challenge to local adaptation decisions, because greater uncertainty may require costlier protective measures. However, this paper demonstrated that large uncertainties at the local scale do not preclude more robust projections at the global scale (see also Hattermann et al. 2018).

5.5 Conclusions

The comprehensive five-step evaluation of the catchment-scale hydrological models was compared with the commonly used simplified calibration/validation approach in terms of simulated impacts. It was shown that (a) in most cases, impact results for annual and monthly means notably differ between simulations based on two approaches, and (b) uncertainties of projections related to hydrological models are usually reduced after the enhanced model evaluation. As models after successful enhanced evaluation are more robust, and projections based on them differ from the projections simulated by models after simple calibration/validation, we can conclude that the models after the successful enhanced evaluation are more reliable and impacts based on them are more trustworthy.

The evaluation of global hydrological models and their application for impact assessment using the EM and WCO approaches has shown that using only models with satisfactory or good performance on historical data and weighting them based on model evaluation results is a more reliable approach for impact assessment compared to the ensemble mean approach. The obtained results allowed to conclude that, in most cases, the WCO method provides impact results with higher credibility and reduced spreads in comparison to the EM approach. Regarding further gHM applications for climate impact assessment, we could recommend the following: (a) model evaluation should be always done in advance of impact assessment with gHMs; (b) improvement of gHMs performance, also using calibration, is necessary in order to include more suitable models in ensembles for projecting impacts; and (c) inclusion of region-specific processes (e.g., permafrost in the Arctic) in gHMs is also necessary for making impact results more trustworthy.

The value of this SI for the science and the stakeholder community is in the following:

  1. (i)

    As the comprehensive model evaluation includes special tests of model robustness under changed climate conditions, it improves credibility of simulated impacts for stakeholders, and the reduced spreads of projections related to hydrological models make them more distinct.

  2. (ii)

    The methodology of comprehensive evaluation of hydrological models developed before and thoroughly tested in the papers in this Special Issue could be useful for a broader scientific community doing research for sectors beyond hydrology and water resources, where climate change impact assessment is relevant.

  3. (iii)

    This Special Issue contributes to the improvement of credibility and reduction of uncertainty of climate impact projections that are needed for development of adaptation strategies to climate change.