1 Introduction

Ongoing climate change will affect virtually all sectors of our society in multiple, complex, ways. Assessment of these expected impacts has been an integral part of the Assessment Reports of the Intergovernmental Panel on Climate Change (IPCC) since the beginning (IPCC 1990). Increasingly, these impact projections have been based on impact model (IM) simulations. Ranging from simple statistical models to sophisticated process-based models that can be used across multiple sectors, IMs are typically driven by the meteorological output of global or regional climate models and effectively translate the projection of a future climate into a projection of an impact.

Of course, running an additional impact model will often, although not necessarily always, add to the overall uncertainty that is inherent to climate change projections (Zscheischler et al. 2018; Kundzewicz et al. 2018). Over the past decade or so, several impact model intercomparison projects (MIPs) have been initiated in an effort to explore some of this impact model uncertainty, and enable coordinated assessments of climate impacts within and across sectors. Examples of these include the Integrated Project Water and Global Change, WATCHFootnote 1 (Harding et al. 2011), the Agricultural Model Intercomparison and Improvement Project (AGMIP)Footnote 2, and the Inter-Sectoral Impact Model Intercomparison Project ISIMIPFootnote 3 (Warszawski et al. 2014). The analysis of the outcomes of these MIPs has raised new questions about how to interpret the results of these multi-model experiments. As we will see, the focus has increasingly been on model evaluation, with the implicit or explicit assumption that better IM performance over a past period, as measured by the comparison of simulation and observation, yields greater trust in its projections for the future.

Naturally, model evaluation is a key step in the development and application of any model of the environment. However the assumption that a model can somehow be “validated” by comparing its predictions with observations has also been challenged (see e.g., Oreskes et al. 1994; Parker 2013). The reasons for this include the notion that natural systems are complex, our understanding of them is incomplete, and the observation data we use are poorly defined and uncertain. Within the context of climate change, a specific problem is that future climate conditions can be very different from the historical climate. For example, a simple statistical model might accurately reproduce historical streamflow, but it would be unwise to apply the same model to a very different future climate while claiming it is trustworthy because it compares well with past observations.

In this paper we will review how multi-model ensembles are used to produce physical climate impact projections, focusing on global-scale hydrological models in particular. How water resources and streamflow characteristics will change into the future is arguably one of the most pertinent questions to the problem of the impact of climate change.

2 Types and sources of uncertainty

In this section we will briefly review the sources of uncertainty in climate impact assessments. A prime source of uncertainty—literally a lack of certainty, or precise knowledge—stems from the use of models. Since we cannot examine the behaviour of a catchment under future conditions in a laboratory, climate impact projections are by necessity model-based, and models are, inevitably, subject to considerable uncertainty (Oreskes et al. 1994). It is common to subdivide uncertainty within the modelling process into (1) uncertainty about the model structure or, in other words, about how to represent the physics of the system; (2) uncertainty about the input data and model parameter values, which extends to the data used for model calibration and evaluation; and (3) the residual unpredictability of events for given models and parameters.

The first two sources can be taken together as “epistemic” uncertainty (Beven 2016) (after the Greek word for “knowledge”) and arises from the fact that models are an abstraction of reality. Some of the key processes are still not well understood or represented. Limits to computing power and data availability (e.g., on soil properties) mean that sometimes processes need to be represented in a simplified way, or are missing altogether; for example, groundwater is missing in many models or is parameterized very simplistically. Often it is not clear which representation, and which parameter values, would be most suitable for the application at hand.

The latter source of uncertainty is sometimes called “aleatory” uncertainty (from the Latin word for dice) and arises from natural variability or randomness that essentially cannot be reduced. Although the distinction between epistemic and aleatory uncertainty is not always easy to make in practice, it is still useful framework to keep in mind. In particular, the existence of epistemic uncertainties implies that model errors may not always follow a simple statistical distribution (Beven 2013).

For climate impact assessments, we also need to consider the entire modelling chain, from socio-economic emission scenarios to climate models, including climate model downscaling and bias correction, to impact models, impact assessments, and adaptation decisions. Each component in this “cascade” will have its own associated uncertainties (Beven et al. 2018). Kundzewicz et al. (2018) comprehensively discuss the sources of uncertainty in the emission – climate change – impact chain. Some of these uncertainties might be reducible, that is, by adding new information to the process the range of possible outcomes could be constrained (see Fig. 1).

Fig. 1
figure 1

A general framework for reducing uncertainty in assessment of climate change impact on water resources (modified from Kundzewicz et al. 2018)

Climate projections are uncertain first of all because we are uncertain about future levels of greenhouse gas emissions and concentrations. Projections of long-term climate change are therefore based on a set of assumptions, or scenarios, about how these factors will develop, such as the Representative Concentration Pathways (RCPs) (Moss et al. 2010). Strictly speaking, the RCPs should not be interpreted as forecasts and no likelihood or preference is attached to individual scenarios (van Vuuren et al. 2011). Instead, multiple plausible futures need to be considered (see Maier et al. 2016). Climate projections, and by extension projections of impacts, are therefore not predictions in the classical sense, but rather scenario studies.

In the more immediate future, climate modelling uncertainties are more dominant than emission uncertainties, because near-term climate is strongly conditioned by past emissions (committed warming) (see, e.g., Hawkins and Sutton 2009, 2011). At decadal timescales, climate predictions are also subject to initial condition uncertainty, where small errors in the initial state of the model can grow into marked differences in the development of the climate system (Suckling 2018). This source of uncertainty is particularly relevant in weather forecasting and seasonal to decadal climate prediction but also affects longer-term climate projections, especially when looking at climate variability and extremes.

Climate impact studies often select a subset of the available set of global climate models (GCMs) to provide the meteorological input for impact models. Especially at smaller scales GCMs can exhibit significant biases, and the application of downscaling techniques (either empirical-statistical or dynamic) and/or bias correction techniques is often required. Although the aim is often to reduce biases by bringing climate model output closer to observations, these techniques are, in effect, models in their own right based on assumptions that may or may not hold under future climate change. Some have therefore argued that these techniques hide rather than reduce the uncertainty (Ehret et al. 2012).

For impact modellers it may seem natural to assume that uncertainty from climate models dwarfs uncertainty due to impact models, but at least a number of studies have suggested that impact model uncertainty can also be significant (e.g., Haddeland et al. 2011; Vetter et al. 2017; Beck et al. 2017). However, few studies have made a thorough end-to-end assessment of the uncertainties involved, as was done for the entire flood risk chain by Metin et al. (2018).

Several recent papers discuss the issue of “deep uncertainty” in environmental modelling and risk assessment (Spiegelhalter and Riesch 2011; Beven et al. 2014, 2018). This issue resembles the concept of “vague uncertainties” of Budescu and Wallsten (1987) and similar concepts dating back to the 1920s (Knight 1921), and extends the notion of epistemic uncertainty in the model itself to include problems such as unknown inadequacies in the modelling process, and possible disagreements about the framing of the problem. An implication of deep uncertainty is that we may never be able to fully describe or “quantify” the uncertainty inherent to future climate change and its impacts.

3 Evaluation of multi-model ensembles

These days multi-model ensembles have become the primary route to explore the uncertainty in projections of climate change and its impacts. The impact MIPs that have been initiated over the last decade (Harding et al. 2011; Warszawski et al. 2014) followed the example of similar initiatives in the climate modelling community, and many of the questions that surround the interpretation of climate model ensembles (see Parker 2013) also apply to ensembles of impact models. Can we use them to infer probabilities of future climate impacts? Are robust findings especially trustworthy? As decisions on climate-change mitigation and adaptation are potentially influenced by the outcomes of these ensemble studies, answering questions such as these is not purely an academic exercise (Parker 2013).

A common finding is that different models, whether hydrological models or land surface schemes, lead to different results, in other words, impact model uncertainty is a significant component in the overall uncertainty (e.g., Haddeland et al. 2011). Although this result mirrors what has been found in the climate modelling community, it may still come as a surprise to some, considering that hydrological models are less complex than climate models.

3.1 Sources of bias

Disentangling what is causing these differences between models has proved a greater challenge. Key uncertain processes appear to be evapotranspiration (ET) and snow accumulation and melt. Haddeland et al. (2011) noted that Land Surface Models (LSMs) generally simulated lower snow accumulation and melt than Global Hydrological Models (GHMs). While GHMs typically use a conceptual degree-day approach to simulate snow accumulation and melt, LSMs normally include a more complex energy balance scheme that also simulates snow sublimation, which explains some of the differences. Similarly, Beck et al. (2017) found that in regions dominated by snow GHMs performed better than LSMs, which they ascribed to more data-demanding snow routines or misrepresentation of frozen soil and snowmelt processes by the LSMs.

Haddeland et al. (2011) noted that models calculating evapotranspiration based only on temperature yielded different results than those that included radiation and humidity, and differences tended to be smaller in wet climates than in dry climates. As a consequence, while the largest absolute differences in simulated runoff were found in the tropics, the largest relative differences occurred in arid areas, with models generally overpredicting runoff in arid and semiarid basins. Guo et al. (2017) found that also the scheme used in converting potential to actual evapotranspiration can have a major impact on the results, yielding a more than sevenfold difference in estimated runoff sensitivity.

An overestimation of runoff in dry basins was also found by Zaherpour et al. (2018). The majority of the models mostly overestimated the mean annual runoff and all indicators of upper and lower extreme runoff, and in particular low flow indicators. Capturing the seasonal dynamics of streamflow proved also difficult, with models struggling to get the timing right particularly in northern basins, while in southern areas the magnitude of the seasonal cycle was often more problematic (Zaherpour et al. 2018).

Few studies have mentioned the representation of soil moisture and groundwater dynamics as a source of model biases. Both are highly uncertain components of the overall water balance even from an observational perspective, and even the more physically based models usually simulate groundwater in a very simplified manner unlikely to resemble the actual processes. Improving groundwater processes in models, for example by assimilating satellite data, could improve hydrologic simulations (Lo et al. 2010; Koirala et al. 2014).

A major obstacle to understanding which schemes for evapotranspiration, snow, or soil moisture perform better under what conditions, is a lack of suitable observation datasets to evaluate these processes separately. Most studies evaluate the model performance with observations of river discharge, and infer deficiencies in the model. However, biases in the simulated flow may be caused by a number of factors, including biases in the meteorological input data (see e.g., Haddeland et al. 2012; Müller Schmied et al. 2016) or a lack of understanding of the soil properties, and even in simple models these causes may interact in complex ways. Since discharge is an integrated measure of processes over the entire basin, some of these biases may counter each other, leading to plausible results. Good performance in simulating discharge at the catchment outlet therefore does not guarantee that all processes in the basin have been represented realistically.

Few studies have attempted to evaluate large-scale hydrological models more broadly (see overviews in Zaherpour et al. (2018) and in Krysanova et al., in this SI). Zhang et al. (2016) used observations on evapotranspiration from around the world, as well as observations of streamflow in evaluating two different models. They found that the ET simulated by the models compared better with the observations than runoff, with runoff biases typically, but not always, being the opposite of biases in ET. An important caveat, though, is the limited number of years of ET data that was available, often not located in the same catchment as the streamflow observations.

So while there is potential for different types of observations to be used in model evaluation, limitations in the spatial and/or temporal coverage, measurement uncertainties, and potentially even conceptual differences between the variables in the model and the processes and properties that can be observed in the real world remain an issue. To overcome some of these problems, Beck et al. (2015) produced a global dataset of observation-based estimates of hydrological streamflow characteristics. Although still based on empirical models, such estimates may provide a useful benchmark to evaluate the performance of GHMs. Nevertheless, we need to keep in mind that all observational data are uncertain (Beven et al. 2019) and it is therefore essential that model evaluation is undertaken within an uncertainty analysis framework (Lane et al. 2019).

Different types of observations, and in particular satellite observations of quantities like snow cover, land surface temperature, and leaf area index, could also be used in multi-objective model calibration and parameterization (Zhang et al. 2016). Studies have shown that multi-objective calibration against multiple data sources can improve the model performance in simulating processes such as snow accumulation and melt, as well as streamflow (Crow et al. 2003; Udnæs et al. 2007; Parajka and Blöschl 2008; Zhang et al. 2009). However, such an approach is not usually adopted in GHM development and application, and in fact, GHMs are not usually calibrated even to discharge at the catchment outlet (see Gaedeke et al. and Krysanova et al. in this SI).

3.2 Influence of model type and calibration

When compared with observed river discharge, hydrological models sometimes show smaller biases than LSMs (e.g., Beck et al. 2017). To some extent this is expected, as hydrological models broadly solve only the water balance, while LSMs aim to close both the water and the energy budget of the land surface. Haddeland et al. (2011) noted that both the mean and median runoff fractions for the LSMs were lower than those of the GHMs, although the range was wider.

To understand these differences, it is worth keeping the aims in mind with which these models have been developed. Many hydrological models were developed with a focus on predictive skill, and thus tend to be very parsimonious and conceptual. Typically, they are highly abstracted and contain a small number of parameters that can easily be calibrated with observations, and therefore tend to outperform more physically based models when using traditional evaluation methods.

In contrast, LSMs have been built based on an understanding of the main processes and to explore the interaction between processes. Although LSMs represent these processes on a physical basis, they often cannot outperform GHMs as including more and more complex processes also implies larger uncertainties, especially when data are limited to constrain those processes in the models. For instance, LSMs use sophisticated energy balance approaches to model ET but only limited observations exist to evaluate these approaches, let alone calibrate the parameters involved.

Nevertheless, GHMs do not always outperform LSMs: Zaherpour et al. (2018) and Krysanova et al. (this SI) include at least one LSM with smaller biases than several GHMs. Beck et al. (2017) found that the LSMs performed similarly to (uncalibrated) GHMs in rainfall-dominated regions, while in snow-dominated regions the GHMs performed consistently better. Similarly Zhang et al. (2016) found that both LSMs and GHMs can simulate monthly and interannual variability and trends in streamflow reasonably well, even if they cannot adequately reproduce the long-term volumes. They concluded that both types of model can be used for comparative regional and global water balance assessments and projections of future trajectories.

Many studies have found that models that have been calibrated (usually with streamflow data) perform better when compared with river discharge observations (e.g., Beck et al. 2017). Krysanova et al. (2018) noted several problems related to the use of uncalibrated GHMs, including poor performance in many basins and a high spread in climate impact projections, sometimes leading to conflicting results.

Although observations of runoff are not available everywhere around the globe, some GHMs have successfully been calibrated. For example, the WaterGAP model was tuned to long-term average discharge at over a thousand gauging stations (Müller Schmied et al. 2014). Methods for regionalizing model parameters exist but may need to be improved or applied more consistently (Beck et al. 2017). Calibration is easier for catchment-scale models, although even these are typically calibrated at the outlet point only, and good performance there does not guarantee unbiased simulations throughout the catchment. Since calibration may correct for biases in the input data as well as in the model, the better performance may also be restricted to a particular meteorological dataset. However, the sensitivity of a particular model parameterisation to changing input datasets is not commonly assessed.

Hattermann et al. (2017) compared hydrological projections from nine global and nine regional hydrological models with an emphasis on model validation, looking at sensitivity of annual discharge to climate variability and of seasonal dynamics to climate change. The mostly uncalibrated GHMs showed a considerable bias in the long-term average monthly discharge, although they did in many cases reproduce the intra-annual variability well. In contrast, the regional models, tuned to the specific catchments, were better able to reproduce streamflow conditions in the reference period.

Perhaps surprisingly, Hattermann et al. (2017) found that the sensitivity of both types of models (evaluated for their respective ensembles) was quite similar in most basins. They concluded that the GHMs can be useful tools when looking at large-scale impacts of climate variability and change. For local applications, the regional-scale models should be preferred.

3.3 Stationarity of model parameters

A key concern in the application of calibrated models should be the stability or stationarity of model parameters. For example, Merz et al. (2011) found that the optimal values of calibrated parameters changed considerably with time. Assuming time invariant parameters led to significant biases in their simulations, with errors increasing with the time lag between the simulation and calibration periods. A similar result was found by Li et al. (2012), who also noted that some model parameters were significantly more sensitive to the choice of calibration period than others. The use of calibrated models may therefore result in better model performance when compared with historical observations, and therefore higher trust in the model, but it does not necessarily imply reduced uncertainty in the future projections. The assumption of parameter stationarity may introduce additional uncertainties in the simulated response to climate change that is not normally explored (for example through sensitivity analyses).

Several approaches have been proposed to address the issue of non-stationary parameters and the related problems of miscalibration and overcalibration (see also Andréassian et al. 2012). Li et al. (2012) used a Monte-Carlo approach to explore the uncertainty and possible equifinality in hydrological model parameters. They also recommend calibrating a model on wetter periods of the historical record if it is being used to simulate wet climate scenarios, and vice versa for drier scenarios. Similarly, Coron et al. (2012) recommend testing the model robustness and propose a generalized split-sample test to provide insights into the model’s transposability over time under various conditions. Krysanova et al. (2018) suggested evaluating models using a proxy climate test.

Westra et al. (2014) proposed a strategy for diagnosing and interpreting hydrological nonstationarity, consisting of investigating potential systematic errors in the calibration data, exploring time-varying model parameters, and trialling alternative model structures. They suggested that time-varying parameters could be a diagnostic for model misspecification: in other words, deficiencies in model structure are likely to express themselves as differences in the estimated parameters when calibrated to climatologically different periods. Wallner and Haberlandt (2015) also investigated the impact of nonstationarity on model performance for different flow indices and time scales and showed that non-stationary parameters can improve the performance with an acceptable growth in parameter uncertainty. Like Li et al. (2012), they also found that some model parameters are highly correlated to some climate indices.

In other cases, for example parameters that relate to the groundwater stores, the assumption of time invariance may hold better, but without further exploration this is still an assumption—and should be recorded as such.

Singh et al. (2011) proposed a trading-space-for-time framework that utilizes the similarity between the predictions under change and in ungauged basins. They noted that the trading-space-for-time approach resulted in a stronger watershed response to climate change for both high and low flow conditions, compared with simulations based on historically calibrated parameters.

However, Stephens et al. (2020) warn against the use of historical periods as proxies for future climate conditions, as levels of carbon dioxide were lower than what is expected in the future. Long-term changes in the ecohydrological functioning of a catchment need to be considered, as relatively brief periods in the past that were wetter or drier than average are unlikely to provide good guidance to what will happen under persistent changes in the future. They conclude that many studies likely underestimate the potential for nonstationarity in hydrologic assessments, especially in case of drier future conditions.

3.4 Parameter uncertainty

In this context, it is noteworthy that relatively few studies have examined the effects of uncertainty in model parameters on climate impact projections, even though techniques to estimate this uncertainty have been around for more than two decades (e.g., Beven and Binley 1992). Similar efforts in climate modelling are now well-established through the application of “perturbed-parameter” ensembles (Murphy et al. 2007; Frame et al. 2009), where a single GCM is run multiple times with different values for some of the key parameters. Due to computational limits, a formal sampling of the full parameter space is out of reach for state-of-the-art, complex earth system models. In practice, the key parameters and parameter values are chosen from ranges considered plausible on the basis of expert judgement. Statistical methods have also been used to estimate the set of projections that would be produced if more comprehensive sampling of parameter uncertainty in the model could be performed (see, e.g., Sexton et al. 2012).

A common finding from these studies is that the uncertainty range in perturbed-parameter ensembles overlaps with those of multi-model ensembles, and Beck et al. (2017) speculate that the same may also be true for hydrological models. When properly designed, such multi-parameterization ensembles may allow a more probabilistic analysis of the results, including the adoption of probabilistic verification techniques that have been widely used in ensemble weather prediction and hydrological forecasting (see, e.g., Franz and Hogue 2011) and could also be used in the evaluation of climate impact models.

3.5 Summarizing multi-model ensemble results

A number of recent studies focused on summarizing the results across the ensemble of multiple models, rather than analysing the differences between them. In their analysis of an ensemble of six global-scale hydrological models, Zaherpour et al. (2018) found that, contrary to expectations, the ensemble mean failed to perform better than any individual model. Similarly, Beck et al. (2017), evaluating 10 state-of-the-art macro-scale hydrological models, found that the multi-model ensemble mean generally did not perform better than the best-performing model or models in the ensemble. These findings are somewhat different from studies of multi-model ensembles in weather and climate modelling, where the ensemble mean is often found to outperform any individual model (e.g., Tebaldi and Knutti 2007; Sanderson and Knutti 2012). More in line with these other fields, Beck et al. (2017) noted that the inclusion of less-accurate models did not severely degrade the overall performance of the ensemble.

The ensemble mean is a straightforward, widely used, method of summarizing the performance of an ensemble of hydrological models. However, the results of Zaherpour et al. (2018) and Beck et al. (2017) suggest that users should not assume a priori that the ensemble mean produces the most trustworthy projections. Zaherpour et al. (2018) recommended the use of weighting individual models based on their performance in the evaluation period. Similar and new techniques using advanced methods for model weighting and process-based observational constraints are already being used in the climate modelling community (see e.g., Giorgi and Mearns 2003; Gillett 2015; Sanderson et al. 2017; Eyring et al. 2019).

However, when analysing the results of multi-model ensembles, even when using simple statistics such as the ensemble mean, we need to ask ourselves whether the usual statistical assumptions actually hold. Noting that many of the studies discussed here are “ensembles of opportunity” and were never designed for such a statistical analysis at the outset, a key concern is to what extent these models can be considered independent. For the Coupled Model Intercomparison Project CMIP5 ensemble of GCMs, Knutti et al. (2013) established that many GCMs were not only strongly tied to their predecessors, but also exchanged ideas and code with other models, implying that the CMIP5 models were neither independent of each other nor independent of the earlier generation. They argued that this interdependence of models complicates the interpretation of multi-model ensembles but largely goes unnoticed. The same may also apply to the ensembles of GHMs and LSMs discussed here, yet the degree of interdependence between these models has never been thoroughly examined.

Keeping in mind that multi-model ensembles are a way to explore structural uncertainty in model formulation, one should ask whether these ensembles of opportunity are indeed sampling the relevant space of possible alternative model structures, if that space could even be specified (Parker 2013). The point of multi-model ensemble studies is not to produce only an ensemble mean that may or may not compare better than individual models with observations in a particular area. Instead we need to look at the full range of responses, better understand why some of the differences occur, and better understand what this tells us about the uncertainty in the projections of climate change impacts.

3.6 Incorporating human factors

River basins around the world are increasingly being modified by human activities, such as building reservoirs and extracting water for irrigation. To enable applications in water resources management, many large-scale hydrological models have now included these anthropogenic factors. This demonstrably enhances model simulation capabilities and enables a more realistic comparison with observations (Zaherpour et al. 2018; Veldkamp et al. 2018).

Veldkamp et al. (2018) compared the results of five state-of-the-art GHMs with observations to examine the role of human impact parameterization (HIP) in the streamflow simulation. Their finding was that inclusion of human activity in GHMs can significantly improve the model performance and this finding is robust across both managed and near-natural catchments and across the GHMs. The inclusion of HIP was found to lead to a significant improvement (decrease in the bias of the long-term mean monthly discharge and an improvement in the modelled hydrological variability ratio). Including HIP in the GHMs also leads to an improvement in the simulation of hydrological extremes. While HIP generally leads to an improvement in the absolute magnitude of simulated high flows, its impact on low flows is mixed.

Liu et al. (2017) noted that parameterizing anthropogenic water uses in GHMs is likely to introduce additional uncertainty in GHMs. Using four GHMs, they conducted the first quantitative investigation of between-model uncertainty resulting from the inclusion of human impact parameterizations. The differences between the two experiments were found to be significantly related to the fraction of irrigation areas of basins. Liu et al. (2017) also discussed differences in the parameterizations of irrigation, reservoir regulation, and water withdrawals, towards potential directions of improvements for future GHM development. Further discussion on including human interventions in hydrological models can also be found in Nazemi and Wheater (2015), Pokhrel et al. (2016), and Wada et al. (2017).

4 Uncertainty and scale

The uncertainty in climate impact projections is intrinsically linked to the scale of the analysis. To illustrate this point, we revisit here the results of Dankers et al. (2014), who provided a first assessment of changes in flood hazard at the global scale based on a relatively large ensemble of climate and impact model simulations from the first (fast-track) phase of ISIMIP (Warszawski et al. 2014). The ISIMIP fast-track experiments were aimed at providing a rapid assessment of projections of climate impacts, whereas later phases have included a greater focus on model evaluation. In total, nine models provided simulations of daily river discharge at a global 0.5-degree grid to the ISIMIP archive. Each IM was driven by bias-corrected simulations of five GCMs (see Hempel et al. (2013) for details) for up to four scenarios of atmospheric greenhouse gas concentrations (Moss et al. 2010). As an indicator of present-day flood hazard, Dankers et al. (2014) estimated the 30-year return level of river flow (Q30) at each grid cell for the 30-year period 1971–2000.

Projections of flood hazard and extreme events in general are typically subject to large uncertainty arising not only from climate and hydrological modelling uncertainties, but also from uncertainties associated with estimating the frequency (or probability) of hydrological extremes from relative short timeseries. The uncertainty related to estimating extremes is in essence a sampling uncertainty and is a function of the length of the timeseries being used. Extreme river flows such as the Q30 directly relate to the hazard of a flood event happening along a given stretch of a river, but not to the hazard of flooding of a specific area, which would require additional inundation modelling. Note that the Q30 is not a very extreme discharge level: while the probability of exceedance (Pe) in any given year is 1/30 (0.03), in any given 10-year period it amounts to almost a third (0.29). However, from 30 years of data it can be estimated more robustly than other indicators that are sometimes used, such as the 100-year return level.

Dankers et al. (2014) noted that in individual river basins the uncertainty in the projections of changes in flood hazard can be large, and often even the direction of change (i.e., an increase or decrease) is not clear. Figure 2 summarizes the changes in Q30 at the outlet of 12 major river basins across the world by the end of this century under two RCPs. Here, changes in Q30 were calculated by estimating the 30-year return level separately for the period 2070–2099, and the uncertainty associated with estimating an extreme return level after fitting an extreme value distribution (Coles 2001) may have influenced the results to some extent.

Fig. 2
figure 2

Relative change (%) in the 30-year return level of river flow (Q30) at the outlet of 12 major river basins as simulated by nine impact models (IMs), each driven by five general circulation models (GCMs) for two different RCPs. Changes in Q30 were calculated by estimating the 30-year return level separately for the historical (1971–2000) and scenario (2070–2099) periods (see Dankers et al. 2014 for details). The distribution of changes in Q30 across the nine IMs is shown by boxplots for each driving GCM, indicated by numbers on the horizontal axis: 1 = HadGEM2-ES; 2 = IPSL-CM5A-LR; 3 = MIROC-ESM-CHEM; 4 = NorESM1-M; 5 = GFDL-ESM2M. Outliers that fall outside the range of the vertical axis are indicated with x. Note the deviating scale for the Murray–Darling and Nile Rivers

In some river basins (e.g., the Lena) there is a robust signal for an increase in extreme river flow levels across all model combinations, particularly under RCP8.5. But often the signal is much less clear, and in many cases the IMs do not agree on the direction of change even though they have been driven by the same climate forcing (e.g., the Mississippi). Similarly, different driving GCMs sometimes yield conflicting results on the sign of change in Q30 (e.g., the Yangtze). This highlights once again that, in addition to GCM uncertainty, IM uncertainty arising from differences between the impact models can be a significant component of the overall uncertainty.

These results are complementary to those obtained by, for example, Hirabayashi et al. (2013) who similarly found a low consistency in the direction of change in 100-year discharge in many rivers across a larger ensemble of (uncorrected) GCM simulations driving a single flood inundation model. Likewise, Rojas et al. (2012) found large discrepancies in the magnitude of change in flood hazard at the scale of individual river basins in Europe in an ensemble of 12 climate simulations driving a single hydrological model. More recently, Do et al. (2020) studied, at the global scale, historical and future changes of annual maxima of 7-day streamflow, using a comprehensive streamflow archive and six GHMs. Models show a low to moderate capacity to simulate spatial patterns of historical trends, highlighting the role of model structural uncertainty.

In many cases, these global IMs were not tuned to local-scale conditions, and we should ask ourselves if we should use their results to understand climate impacts in a single basin. It may be better to represent their results at global or perhaps regional (sub-continental) scale. To obtain a global aggregate picture we can calculate a global exceedance rate (E), summarizing how often in a given year the historical Q30 is exceeded globally:

$$ {E}_s=\frac{\sum_{i=1}^N{\sum}_{d=1}^D\left[{Q}_{i,d}>Q{30}_i\right]}{N} $$

where Es is the global exceedance rate for a given model simulation, N is the number of land grid cells, D the number of days in a year, and Qi,d is the simulated river discharge.

In essence, E is a measure of the frequency of occurrence of high-flow events (not necessarily flood events) worldwide. It has the advantage that changes in this frequency can be calculated without the need for fitting a new extreme value distribution to a future time period as was done in, for example, Fig. 2.

Since the Pe of Q30 in any grid cell is 0.03 in any given year, we can expect the Q30 to be exceeded at roughly 3% of the land grid points in any year (E = 0.03). In a stationary climate, E would remain at its expected baseline level, and indeed, in the historical part of the simulations (1971–2000) the average E across the ensemble of 45 GCM-IM combinations is 0.032 ± 0.010 (Fig. 3).

Fig. 3
figure 3

Change in the Q30 global exceedance rate E across the ensemble of 45 GCM-IM combinations under the scenarios RCP2.6 (left panel) and RCP8.5 (right panel). The dark shaded area shows the interquartile range in E across the ensemble, the light shaded area the total range. The dashed horizontal line shows the expected baseline E in the historical part of the simulations (1971–2000)

After the first decade of the twenty-first century, however, the simulations suggest a rapid increase in the global exceedance frequency. This increase is robust across all GCM/IM combinations, albeit stronger in some GCM simulations than others (Fig. 3). Under the high-end greenhouse gas scenario RCP8.5, E is on average 0.152 ± 0.045 in the last two decades of the century, suggesting that globally Q30 levels will be exceeded almost five times more often than in the historical period. The aggressive mitigation scenario RCP2.6 avoids most of the strong increase in E after mid-century, but E is still 0.075 ± 0.024 or more than double the historical rate by the end of the present century.

In both RCPs, the simulations driven by the GCM NorESM1-M generally show the smallest increases in E and MIROC-ESM-CHEM the largest. But there are differences between the IMs, too, with the MATSIRO simulations resulting in the smallest increases in global exceedance frequency on average, and PCR-GLOBWB and (in RCP8.5) JULES the largest.

Analysis of variance (ANOVA) of the simulated E in 2080–2099 shows that both GCM and IM have a significant effect (p < 0.001) on the overall variability in the results, with a significant (p < 0.001) interaction between the two factors. In RCP8.5, the partial effect size (\( {\eta}_p^2 \)) for the factor GCM is of similar magnitude to that for the IMs (0.61 vs 0.58, respectively), while in RCP2.6 the GCMs contribute more to the total variation (\( {\eta}_p^2=0.41 \)) than the IMs (\( {\eta}_p^2=0.31 \)). This highlights that even at the global scale IM uncertainty has a significant impact on projections of changes in Q30 exceedance frequency.

So while in individual basins the GCM-IM combination give very different results and sometimes will not even agree on the direction of flood hazard change, at the global scale the signal for an increase in the global exceedance rate E is remarkably robust across model combinations. This finding is analogous to Fischer et al. (2013) who found that spatially aggregated projections of precipitation extremes can be highly robust even if they are very uncertain at the local scale. In other words, we can state with more confidence that the RCP8.5 scenario will lead to more frequent flood events globally than where exactly these events will occur.

5 Conclusions

A number of concluding remarks can be drawn from the previous discussion, as well as the example of changes in flood hazard at global scale. First of all, it is clear that in addition to GCM uncertainty, IM (structural) uncertainty can be a significant component of the overall uncertainty in the projections of climate-change impact on water resources. This implies that studies that are based on a single hydrological model may well be overconfident and do not adequately sample the uncertainty range even if they use multiple driving climate models and/or realizations to account for the uncertainty in the projected climate change.

Following Beven (2013), a multi-model approach can be an effective way to explore some of the epistemic uncertainty in impact modelling. However, the mere fact of using a multi-model ensemble does not mean that all sources of uncertainty in the projections have been fully represented or quantified.

We also need to question ourselves whether we can analyse multi-model ensembles statistically as if they are similar to aleatory uncertainty, with simple, known, error distributions. This is especially true for “ensembles of opportunities” that were not designed for such a statistical analysis on the outset. We need to understand better to what extent the models included in some of the MIPs share the same modelling approaches and process descriptions, and where they are different.

Several studies (e.g., Haddeland et al. 2011; Beck et al. 2017) have found that hydrological models, especially when calibrated at basin scale, tend to outperform the more complex LSMs in reproducing hydrological variability, when compared with observations (usually limited to records of river discharge only). However, in the context of climate change a key process—and a good example of some of the “deeper” uncertainties involved—is the response of the vegetation to the changing climate (Davie et al. 2013; Stephens et al. 2020). At least in a qualitative sense it is well-established that higher concentrations of atmospheric CO2 will affect the water-use efficiency of plants, yet unlike the LSMs most hydrological models ignore this process altogether. If the actual sensitivity to elevated CO2 concentrations is high, studies that use only hydrological model simulations on the basis of their seemingly better performance in the past risk underestimating the true uncertainty in the projected impacts.

We have seen that uncertainty is larger at the smaller scale of individual river basins. This provides a challenge to local adaptation decisions, as greater uncertainty may, for example, require greater protective measures in order to keep the flood hazard at the same level (cf. Hunter 2012). The implication is that global-scale modelling projections will not necessarily provide the best guidance for local-scale decisions. A different approach, more tuned to local conditions, may well be required in order to reduce uncertainty at local scale. For example, it may be possible to reduce the spread in multi-model ensemble results by down-weighting or eliminating models that are clearly unable to reproduce important aspects of the water cycle in a particular catchment, while in global scale applications it is unlikely that any one model will be “good” or “bad” everywhere around the globe. However, this desire to narrow the model spread needs to be carefully balanced against the need to sample the full uncertainty range, including the extremes, in order to avoid overconfident projections that may result in wrong adaptation decisions (Knutti 2010).

In the face of large uncertainty that is unlikely to be fully sampled by a limited set of hydrological or impact models, a more productive approach could be to focus on the information that a model or set of models can provide to enable (quasi-) ‘optimal’ decisions (Gupta et al. 2012; Nearing and Gupta 2015). One way to deal with large uncertainty is to evaluate the sensitivity of the decision against a range of possible climate outcomes, thus highlighting critical vulnerabilities that may warrant further attention (cf. Prudhomme et al. 2010). Kundzewicz et al. (2017) noted that it is rather naïve to expect that reliable (in a statistical sense) quantitative projections of future flood hazard may become available. Hence, in order to reduce flood risk, one should focus attention on identification of current and future risks and vulnerability hotspots and improve the situation in areas where such hotspots occur (Kundzewicz et al. 2017).

Perhaps a comprehensive evaluation of all the uncertainties involved in the cascade of climate and impact models, and an honest appraisal of the “deep” uncertainties associated with the modelling process, may feel overwhelming. Yet, that is no reason for ignoring these uncertainties (Pappenberger and Beven 2006). Since every analysis is conditional on the assumptions about the sources of epistemic uncertainty, Beven et al. (2018) recommend to record these assumptions and evaluate their impact on the uncertainty estimate (see also Beven et al. 2018).

When operational meteorologists produce their weather forecasts, they tend to use the output of numerical weather prediction models as a guide and interpret the model simulations in the light of known and unknown limitations of the models, and their own expert insight into the evolving weather situation. On occasion, they will deviate from the model guidance and produce their own assessment of the expected weather. In a similar way, impact modellers (being disciplinary specialists) need to interpret, or help their users interpreting, the output of their simulations. This may extend to being able to understand the driving climate models, and the reason for some of the differences observed in these climate models. Impact models can be very useful tools to test our hypotheses on expected future climate impacts, but ultimately they are based on our limited knowledge and judgement and subject to the assumptions that were made during their development, and hence, they need to be used with caution (Spiegelhalter and Riesch 2011).

6 Recommendations

A number of recommendations can be derived from the previous discussion, both for model development and for model use. First of all, we feel that the community will benefit from developing a common language and clear framework for the treatment of uncertainties in climate impact assessments. For model development, there is a need to explore parameter and structural uncertainty in a more consistent manner, akin to the perturbed parameter ensembles used in climate modelling and the approach used by Lane et al. (2019) for river flow and flood prediction in Great Britain, as opposed to the current “ensembles of opportunity.” At the same time, MIPs could be exploited more fully to better understand the mechanisms and processes that lead to different responses in the models and could explain part of the uncertainty in climate impact projections.

Given the scale of human interference in the hydrological cycle, it is imperative that human impacts are included in large-scale models to improve the realism of the models. However, this requires an ongoing effort in data collection and further development of these schemes. Finally, there are processes that could be highly uncertain in the current generation of models, in particular the water-use efficiency of the vegetation under higher CO2 concentrations, and the representation of groundwater dynamics. More effort is needed to investigate the sensitivity of climate impact projections to these processes, and to improve the realism in the models.

With regard to the use of impact models, a strong message that comes through is that climate impact projections should not be based on a single model, or indeed the ensemble mean. Users need to be aware that the true uncertainty is likely to be larger than what has been sampled by current multi-model ensembles. In ensemble weather forecasting, where—unlike in climate impact studies—model predictions can routinely be compared with the actual outcomes, a common finding is that the ensembles are often “underspread,” in other words fail to capture the full range of outcomes especially at longer lead times. In this area of application, the aim is often to increase the model spread rather than reducing it. In a similar way, climate impact studies should aim to capture the full uncertainty range, enabling users to seek robust decisions that perform well across a wide range of possible future climate impact scenarios (Kalra et al. 2014).