1 Introduction

Nowadays, uncertainty in future climate projections is a main concern since these are widely used to assess the impacts of climate change on the environment and to plan adaptation strategies worldwide (Kundzewicz et al. 2008).

Uncertainty of climate models’ outputs arises from the challenge of representing physical processes, e.g., clouds, turbulence, and further from poor understanding of other processes that can be represented only in simplified form, such as the effects upon climate of growing trees and/or presence of buildings (Curry and Webster 2011). These issues are not resolved in global climate models, but indeed, they impact the simulation of wide-scale temperature and precipitation. On top of such complexity, one has the involvement of multiple subsystems and uncertainty associated with the models’ structure, e.g., parameters, equations, and initial and boundary conditions (Palmer et al. 2005). The Intergovernmental Panel on Climate Change IPCC set several CO2 concentration pathways as external boundary conditions for (future) climate modeling. In the latest Assessment Report (AR6), it is stated that human activities have been the main cause of present global warming. However, there is still skepticism about the reliability of climate projections (Busch and Judick 2021). Global circulation models (GCMs) provide climate projections on a coarse grid resolution, generally not suitable to represent climatic variability at a local scale. The consequence is a lack of confidence in the future projections (Räisänen 2007) and possible over/underestimation of precipitation projections and the related risks (Moreno-Chamarro et al. 2021). This range of uncertainty increases when assessing hydrological projections since a complex modeling chain, from GCMs scenarios to downscaling (most often necessary to cope with the large-scale jump from climate models to hydrological ones), and then hydrological modeling needs to be applied here (Casale et al. 2021). Every step introduces uncertainty, leading to an overall decrease in the dependability of future projections (Camici et al. 2017). Although there is no common agreement upon validation methods for climate models (Guillemot 2010), one method to measure/constrain uncertainty is cross-comparison against other simulations and or against observations, and indeed, the use of past observational data is essential in evaluating possible future scenarios (Koutsoyiannis et al. 2009). It seems clear that measuring such uncertainty plays a paramount role in water resource management, not only when applying structural measures but also for economic and social strategies (Knutti 2008). Hydrologists are thus required to assess accurately the impact of climate projections on the hydrological cycle (Kang and Ramírez 2007). Moreover, when assessing the impact of climate change on a regional scale, it is essential to evaluate the ability of GCMs to describe local meteorological variability; otherwise, poorly representative climate projections may be obtained in the study area (Salathe et al. 2007; Eyring et al. 2016).

The two-fold purpose of this study was to validate control data series from some GCMs and to then evaluate the accuracy of the subsequent climate projections at the local scale, also in assessing hydrological discharges at a catchment scale, with a focus upon a case study area in the Lombardy region of Northern Italy. The chosen region is a sensitive target for climate change because (i) it is largely covered by mountains, the (cryosphere dependent) hydrological regimes of which is already suffering from global warming (Beniston et al. 2011; Fuso et al. 2021), and (ii) it hosts two large regulated lakes, crucial for downstream irrigation and hydropower production (Anghileri et al. 2011). Therefore, reliable hydrological projections are essential from an environmental, social, and economic point of view. We used here (i) 10 GCMs and 4 scenarios of projected socioeconomic global changes (shared socioeconomic pathways (SSPs)) of the CMIP6 of IPCC (with a forward projection phase starting in 2015), (ii) a downscaling model to make the climate projections suitable at the local scale, and (iii) the semi-distributed physically based hydrological model Poli-Hydro to model hydrology under the obtained climate scenarios. To assess the robustness of the hydrological scenarios, a back-cast analysis has been carried out, both in terms of climatic drivers, i.e., precipitation and temperature, and of hydrological discharges, by comparing projections against observed data for 20 years, 2002–2021. First, we performed the back-cast against the control data series (2002–2014) of the 10 GCMs after downscaling by comparing the related statistics of temperature, precipitation, and discharges against the observed ones. After such a test, the most representative GCMs for the area were evaluated. Then, we assessed the goodness of future projections (here, 2015–2021) by evaluating the confidence interval of the SSPs and verifying whether the observed data are well contained within the range of uncertainty (i.e., confidence limits) of the projected variables.

The paper is structured as follows. In the “Study area” section, the study area is presented. The methodology is reported in the “Methodology” section. This includes the (choice of) global circulation models, the downscaling procedure, and the hydrological modeling, followed by the back-cast analysis and the projections. The results are shown in the “Results” section. Discussion and conclusions are in the “Discussion” and “Conclusion” sections, respectively.

2 Study area

2.1 Case study

The study area is a part of the Lombardy region, nested within the Ticino-Adda catchment (Fig. 1). The area hosts several mountains and glaciers, with the highest elevation of 4020 m, thus being largely snow/ice fed and rich in both surface and aquifer water (Casale et al. 2021). Two major lakes are located inside the basin, i.e., Lake Maggiore, with a volume of 37 km3, and Lake Como, with a volume of 23.4 km3. The first is watered almost equally from the Piemonte region and Switzerland rivers, whereas the catchment of the second is for 90% in Lombardy and 10% in Switzerland.

Fig. 1
figure 1

Automatic weather stations (AWSs) available in the study area. Hydrological sub-basins identified by the hydrometers located at Fuentes, Milano Feltre, Brembate di Sopra, and Castellanza

The climate in the region is mainly cold in the lake Como catchment, with hot summers at low altitudes, with temperature decreasing with altitude. Overall, in the high-altitude catchments, temperature varies between −4 °C in winter and +15°C in summer. Total precipitation is around 1300 mm/year on average, with peaks in May and November. Downstream into the lowlands, the climate is temperate, with the highest temperature up to +23 °C in summer and peaks of precipitation in autumn, with a total of 1800 mm/year on average.

The lowland area is inside the Po valley, the most productive agricultural area of Europe (Bocchiola et al. 2013), intensively exploited for agriculture and hydropower production. In recent years, due to climate change, summer droughts have requested more water to be released from the lake, conflicting with interests from hydropower producers needing water during winter (Denaro et al. 2018). Future projections would depict a potential worsening of these issues, given that the increase in temperature and the decrease in precipitation will necessarily call for a renewal in the management strategies of water resources (Fuso et al. 2021).

2.2 Data

Two observational datasets were employed in this study, i.e., (i) daily series of precipitation and temperature (P, T) collected from automatic weather stations (AWS) of ARPA Lombardia for a baseline period (BP) 2002–2021 and (ii) stream flows from four hydrometric stations (daily water level, converted to discharge). In Fig. 1, the 45 AWSs and the catchment area contributing to the four river sections (Adda-Fuentes, Lambro-Milano Via Feltre, Brembo-Brembate di Sopra, and Olona-Castellanza) are shown. The meteorological data available here were used both for GCM validation to feed the hydrological model Poli-Hydro. The hydrometric stations were selected in order to investigate basins with different dimensions and features, as shown in Table 1.

Table 1 Catchments’ features, mean discharge, observed during 2002–2021, and bias obs/mod for each sub-basin averaged during the years 2002–2021

3 Methodology

In this section, our modeling chain used is presented. First, the GCMs used are listed, each with 4 SSPs of the CMIP6. Given the coarse resolution of the GCMs, a downscaling procedure was necessary. Then, the precipitation and temperature series from GCMs, properly downscaled, were used to feed the hydrological model, previously tuned for the baseline period. To increase the reliability of the climate and hydrological projections and to evaluate the most representative GCMs within the chosen set, a back-cast analysis of GCM control data was performed. Finally, the goodness of the SSPs projections was assessed, testing whether the observational data occurred within the range of the projected variables.

3.1 Global circulation models

In this study, 10 GCMs were employed, of which scenarios are being used as part of the experiment Coupled Model Intercomparison Project CMIP6 of the IPCC. We used the following models: CNRM-CM6-1 (Voldoire et al. 2019), IPSL-CM6A-LR (Lurton et al. 2020), NorESM2-LM (Seland et al. 2020), INM-CM5-0 (Volodin et al. 2018), EC-EARTH3 (Döscher et al. 2022), CESM2 (Lauritzen et al. 2018), ECHAM6.3 (Mauritsen et al. 2019), CMCC-CM2 (Cherchi et al. 2018), UKESM1 (Sellar et al. 2019), and MIRCOC6 (Tatebe et al. 2019). The forcing scenarios are a combination of socioeconomic scenarios (SSPs) and representative greenhouse emission scenarios (RCPs, expressed in W/m2) used for the AR5 (Taylor et al. 2012). The SSPs represent several possible evolutions of society, considering investments in education, health, and energy development. SSP 1 and SSP 5 project a positive development of society, but while the latter devise an economy based on fossil fuel, the former conceives a sustainable economy. The SSP 2 scenario depicts a continuation of the historical trend (business-as-usual), while the SSP 3 and SSP 4 foresee a negative (climate-wise) development of societal dynamics worldwide. Four SSP scenarios were used in this study, based on RCP 2.6, 4.5, and 8.5 scenarios, namely, SSP1.2.6, SSP2 4.5, SSP5 8.5, and an intermediate SSP3 7.0 scenario (O’Neill et al. 2016).

3.2 Downscaling procedure

The GCM outputs come with a ∼100 km cell resolution, much coarser than the resolution of a typical hydrological model (such as Poli-Hydro). For this reason, stochastic downscaling is necessary to make the output of the GCM usable at the resolution of the local data or at the resolution of the model (here, 1 km × 1 km2).

Downscaling of precipitation was pursued through the stochastic space random cascade method (e.g., Groppelli et al. 2011a), applied here in time (e.g., Bocchiola and Rosso 2006) to reproduce the observed precipitation variability (e.g., Groppelli et al. 2011b). The control run CR period 2002–2014 was defined to compare the historical series of GCMs against observed data of precipitation. If one defines RdGAO as the observed average daily precipitation in a given day d, at a given AWS, while RdGCM is the daily precipitation simulated by the GCM (in the cell including the AWAs site), the goal of downscaling can be defined as matching (statistically, in terms of mean and variance) the (corrected) value of RdGCM,corr to the value of RdGAO.

$${R_d}^{GCM, corr}={R_d}^{GCM}/{Bias}_{GAO}\ {B}_0{W}_0$$
(1)
$${Bias}_{GAO}=\textrm{E}\left[{R}_{GAO}\right]/E\left[{R}_{GCM}\right]$$
(2)
$$P\left({B}_0=1/p0\right)=p0$$
(3)
$$P\left({B}_0=0\right)=1-p0$$
(4)
$${W}_0=\exp \left({\textrm{w}}_0-{\upsigma_{\textrm{w}0}}^2/2\right)$$
(5)

With

$$E\left[{B}_0\right]={p}_0\ 1/{p}_0+0\ \left(1-{p}_0\right)=1$$
$$E\left[{W}_0\right]=1$$
$${w}_0\sim N\left(0,{\sigma_{w0}}^2\right)$$

BiasGAO, p0, and 𝜎𝑤02 are model parameters to be tuned based on observational data. BGAO is, in practice, a calculated bias, forcing the RGCM average to coincide with the RGAO average. B0 is a β model generator and represents the probability that RGAO is null, conditioned to a positive value of RGCM. W0 is a positive parameter that gives variability to precipitation. Temperature is downscaled by applying a monthly average temperature Bias (ΔT) (Groppelli et al. 2011b). An offset is calculated at the monthly scale between GCM and observed temperature.

$${T_{d,i}}^{GCM, corr}={T_{d,i}}^{GCM}-{T_i}^{GCM}-{T_i}^{obs}$$
(6)

Here, Td,iGCM is the temperature of day d in month i, given by the GCM. TiGCM and Tiobs are the average temperature in month i, given by the GCM and the station, respectively. Td,iGCM,corr is then the corrected temperature of day d, based on the mean temperature shift between the given GCM and observed temperature.

After this procedure, a total of 40 precipitation and temperature scenarios (4 SSPs × 10 GCMs) were used to feed Poli-Hydro for the assessment of hydrological scenarios.

3.3 Hydrological model

We used here the physically based, semi-distributed hydrological model Poli-Hydro (Soncini et al. 2017; Aili et al. 2019; Fuso et al. 2021). The model simulates in each grid cell the main physical processes, including (i) snow and ice melt using a mixed degree-day approach based on average daily temperature and short-wave radiation (Aili et al. 2019) and (ii) potential evapotranspiration using Hargreaves’ formula, and subsequently actual evapotranspiration, evaluated based on soil water content, assessed via water budget. The results of model tuning for the involved catchment in terms of the Bias obs/mod are reported in Table 1 for the period BP 2002–2021.

The Poli-Hydro model was tested within a large array of areas worldwide and climate conditions, starting from mountainous, ice/snow-fed catchments in Italy (e.g., Soncini et al. 2017), in South America (Bocchiola et al. 2018), the greater Himalayas (Soncini et al. 2015, 2016; Casale et al. 2020), and Caucasus (Baldasso et al. 2019) to semi-arid areas of Europe (e.g., Capolongo et al. 2019) and central Asia (Akbari et al. 2018) and to tropical areas in the Caribbeans (Bozza et al. 2016), Africa (Bombelli et al. 2021), and Indonesia (Stucchi and Bocchiola 2023). Poly-Hydro model performed well within all such climatic, topographic, and hydrological setups.

Notice further that the model has specifically been used for hydrological simulation in the area of interest in former contributions, with good outcomes, especially in Adda river and Como lake catchment (Casale et al. 2021), and also in Serio river (not considered here, bordering east of Brembo river in the DEM in Fig. 1; Fuso et al. 2023).

This is a substantially homogeneous area in terms of climate (temperate dry, with hot/warm summer Csa/b, to polar cold and glacial ET, EF at the highest altitudes; Peel et al. 2007), topography (steep mountainous catchments high altitude until ca. 4000 m slm, with large snow/ice feeding), and hydrology (bimodal regime with spring/fall flood season and dry summers; e.g., Bocchiola 2014).

Accordingly, and based on the results here presented for model calibration, we can assume that the Poli-Hydro model is flexible (i.e., in terms of parameters tuning) and portable (between catchments) enough that it can be confidently used here for our purpose of GCM–RCM evaluation and ranking (e.g., Dakhlaoui and Djebbi 2021).

3.4 Back-cast analysis

All GCMs provide climate projections with coarse (spatial) resolution, much lower with respect to the resolution of the hydrological model. To make the projections usable for the case study, a downscaling process was performed. The projected period, e.g., when the models start projecting forward future climate variables, starts in 2015 for CMIP6. Anyway, a window of 20 years during 2002–2021 BP was considered, and thus, a period of 13 years before the start of the future projections was taken as a historical validation period.

First, GCM validation is carried out via an ex-post analysis of the control data series. This analysis involves comparing the precipitation and temperature obtained from the GCMs (after downscaling) against the observed values for each year (2002–2014). This analysis has been carried out both in terms of climatic drivers, i.e., precipitation and temperature, and of hydrological discharges to test the accuracy of GCMs of the downscaling procedure and hydrological modeling.

3.4.1 Climate validation

To assess the suitability of each GCM to represent the local weather, some tests were carried out. A Student’s t-test (hypothesis H0, same mean value) was performed for the observed (Pobs, Tobs) and predicted values of the target variables (PGCM, TGCM), and furthermore, a Fisher’s F-test (hypothesis H0, same variance) was carried out for the variance of precipitation. Temperature variance was not tested because it was observed that (i) once Bias corrected, GCMs generally depict sufficiently well periodic changes in precipitation, and (ii) high-frequency (i.e., daily) changes of temperature do not generally affect hydrological behavior, more affected by seasonal patterns. The observed data of cumulated annual precipitation and mean annual temperature were assessed against the observed data at the scale of each catchment. To do so, we proceeded as follows. Within each basin of interest, we calculated for every day d, the spatially averaged value of precipitation, Pav,d, and of temperature Tav,d within the basin. The averaging was carried out by weighting values in each station (including the corrected values of the GCMs precipitation RdGCM,corr and temperature Td,iGCM,corr, downscaled as reported with reference to the AWS stations’ sites), according to the corresponding Thiessen polygon, consistently with the fact that Thiessen method is often adopted as a rapid interpolation method, especially for hydrological purposes, and Poli-hydro model also operates in this mode. Concerning temperature, to account for the altitudinal thermal shift, we applied a proper lapse rate correction to the AWS temperature within each polygon (Soncini et al. 2017). At the AWS stations, snowfall was identified using snow gauges, and new snow included accounted for in precipitation assessment (with a fresh snow density of 125 kgm−3; e.g., Bocchiola and Rosso 2007). From the so-obtained spatially average daily values in each catchment, we could calculate several statistics to be tested.

The tests were performed on two different types of samples for each variable, i.e., (i) a sample of cumulated daily precipitation Sa(Pobs), Sa(PGCM) and mean daily temperature Sa(Tobs), Sa(TGCM) for each year (2002–2014) and (ii) a sample of cumulated daily precipitation Ss(Pobs), Ss(PGCM) and mean daily temperature Ss(Tobs), Ss(TGCM) over the four seasons, considering the entire historical period. The same tests were carried out, also considering monthly aggregation instead of daily aggregation, thus reducing the sample size. We also considered (iii) monthly cumulated precipitation Sca(PGCM) and mean monthly temperature Sas(TGCM) for each year and (iv) cumulated monthly precipitation Scs(PGCM) and mean monthly temperature Scs(TGCM) over the four seasons, considering the entire historical period. Concerning precipitation, we did not include a statistical analysis of zero rainfall days (dry spells). Zero rainfall days may reflect internal variability of the phenomenon in time and may influence droughts/flood dynamics and, in general, flow seasonality/regime. However, we decided to test here the precipitation variability because we assume that this will affect more directly stream flow variability. Dry spells will be therein, somewhat reflected by low values of precipitation.

We carried out the tests for each GCMs control series downscaled, for a total of 140 t-tests for P and T and 140 F-tests for P (10 GCMs × 14 years) and 40 t-tests for P and T and 40 F-tests for P (10 GCMs × 4 seasons) for each basin. These tests were performed considering both daily (i, ii) and monthly (iii, iv) aggregation. Yearly and seasonal p-values were obtained, measuring the goodness of fit of the (first and second for rainfall) statistics of the simulated variables against the observed ones.

3.4.2 Hydrological validation

To validate the GCMs downstream of the modeling chain, a Student’s t-test was carried out to compare the modeled discharges from GCMs (QGCM) against the modeled discharge from AWS data (Qmod). This comparison was made against the modeled discharges since the performance of the hydrological model was already assessed and deemed acceptable (“Hydrological model”) and to avoid including the comparison of the effect of the model’s noise. To assess the hydrological variations from GCM projections generally, we focused on the difference between mean discharges over 10 years, i.e., at mid-century and mean discharges over 10 years of the CR (Aili et al. 2019; Fuso et al. 2021; Bombelli et al. 2021; Casale et al. 2021). For this reason, we decided to test the hydrological performance of the GCMs considering 4 mobile windows of 10 years, i.e., 4 T-tests were performed for each GCM (2002–2011, 2003–2012, 2004–2013, 2005–2014), considering a sample of 3650 daily values. Like it was done for climate variables, we carried out the tests also considering the mean monthly discharges, thus reducing the sample size to 120 values.

3.5 Projections of future climate and hydrology

After the GCM validation, we averaged among the four SSPs of the outputs from the 10 GCMs, and we assessed the goodness of climate and hydrological projections by verifying whether the observed (in the future) data are within the range of the projected variables during the years 2015–2021. The range of each SSP is represented by the confidence interval at 95%. For precipitation and temperature, the comparison is carried out between the GCM projections properly downscaled against the observed AWS data, while for the discharges, we compared the modeled discharges with Poli-hydro using the inputted AWS data vs. the climate projections downscaled. In doing so, one may (i) assess changes in future vs. future stream flows by offsetting the hydrological model’s error, only considering the effects of the climate, and (ii), specifically here, assess the dependability of outputs from each GCM, independently of the hydrological model interference. Another reason for not using observed stream flows is indeed the potential presence of regulation. Hydrological models generally do not include regulation, and the obtained streamflow series are solely driven by climate variability. Accordingly, when assessing the credibility of future projected streamflow series (against observations), potential differences against the observations may actually depend upon the presence of (even slight) river regulation, which, however, has little to do with the dependability of the tested GCMs. Such potential issues may be actually bypassed using simulated series, both in the control run and in the future. Thereby, any potential effect of regulation would be bypassed, and one would observe only changes/differences as given by the climate input.

4 Results

4.1 Back-cast analysis

4.1.1 Climate models validation

Figures 2 and 3 represent the mean annual temperature and the cumulated annual precipitation observed and modeled for each GCM and over each basin during 2002–2014.

Fig. 2
figure 2

Mean annual temperature over each basin from 2002 to 2014 for each GCM. The points represent the mean value over each GCM (colored), and the black line represents the mean observed temperature from AWS

Fig. 3
figure 3

Cumulated annual precipitation over each basin from 2002 to 2014 for each GCM. The points represent the mean value over each GCM (colored), and the black line represents the mean observed value from AWS

The mean observed annual temperature in the Adda catchment during 2007–2009 is generally underestimated by all GCMs, and none of the models successfully reproduce the lowest observed temperature observed in the year 2010 across all basins. Notice, however, that GCM models, working at the global scale, are expected to represent acceptably general (mean) climate conditions over large scales while being less accurate in depicting more variable (and local) behavior. However, despite such flaws, the historical trend of the GCMs reproduces acceptably well the observed values (Fig. 2).

The observed precipitation displays a similar fluctuating pattern in all basins, with different absolute values, the Adda catchment being the less rainy one (Fig. 3). The historical trend of GCMs is smoother with respect to the observed precipitation, and both overestimation and underestimation can be observed, the former during 2004–2007 for Adda basin and during 2003–2007 for the other catchments, namely, after 2012 for Adda and Brembo and after 2013 for Lambro and Olona.

To aggregate these results, for each model, the number of years (out of 14) when the p-value was significant (i.e., p-value ≥0.05) was lumped (Table 2). We report here the number of years considering both the daily and monthly values. The results show that considering the daily values may result in lower p-values and so into a smaller number of years when we can consider the mean and variance of the climate variables from GCMs to be consistent with the observed values.

Table 2 Number of years in which the p-value of the t-test (for T and P) and F-test (for P) is significant for each GCM

When considering a monthly aggregation, all GCMs present acceptable performance in reproducing the observed climate patterns in all catchments. On the other hand, by increasing the temporal resolution at a daily scale, the models show different results. For temperature overall, the best models seem to be the MPI-ESM, the MIROC6, and the IPSL. The mean value of precipitation is well represented overall by MPI-ESM and CNRM, whereas in terms of variability, the results are less satisfactory.

Beyond average annual values, one wants to assess the changing seasonal patterns of climate variables since their seasonal behavior directly impacts hydrological response. In Table 3, the mean seasonal error between the temperature from the GCMs and the observed values is reported for each basin. For each GCM and basin, we performed 4 Student’s T-tests, one for each season for all the historical periods 2002–2014, considering both the daily and the monthly aggregation. The results are shown in Table 3. For the mean daily temperature, the results of the T-tests were significant for all the GCMs for all the basins, except for the Adda catchment, where all the GCMs gave poor performances in spring and summer (Table 3). This is possibly caused by many missing (no-data) temperature values in the AWSs of the Adda basin in the summer and spring months, which were not filled. However, considering the mean monthly temperature, such an issue is masked since the use of the monthly scale seemingly hides the issue of the daily no-data records.

Table 3 Mean error on the mean seasonal temperature from GCMs Ss(TGCM) against the mean seasonal observed value Ss(Tobs) over each basin for the years 2002–2014 for each GCM

For precipitation, Student’s T-tests and F-tests were carried out for each season for the years 2002–2014, and the mean error for the cumulated seasonal precipitation from GCMs against the cumulated seasonal observed values is reported in Table 4. The GCMs depict well the mean seasonal variability of precipitation, but they fail to represent its changes as quantified by variance, and the noise therein is the highest. By looking at Table 4, T-tests are significant for all the basins, considering both the daily and the monthly aggregation. F-tests, contrarily, are not significant for the daily samples but improve at the monthly scale. However, in spring and autumn, the GCMs still poorly represent the variance of precipitation. This occurs mostly in the rainiest months when the intrinsic variability of the precipitation pattern is higher, thus making it more challenging its representation.

Table 4 Mean error on the cumulated seasonal precipitation from GCMs Ss(PGCM) against the cumulated seasonal observed value Ss(Pobs) for the years 2002–2014 for each GCM

Overall, the best GCMs in representing the seasonal variability of our target climate variables are the CESM2, CMCC, UKESM2, and IPSL.

4.1.2 Hydrological validation

The uncertainty associated with hydrological modeling was also detected, and the benchmarking of hydrological discharges is reported in Fig. 4. We compared the projected discharges from each GCM against the discharges modeled by Poli-Hydro using the observed AWS weather data.

Fig. 4
figure 4

Mean annual discharge over each basin from 2002 to 2014 for each GCM. The points represent the mean value over each GCM (colored), and the red line represents the mean modeled values from AWS

Since the main driver of the hydrological model (and of the hydrological cycle) is precipitation, the annual discharge trend is correlated with precipitation patterns. Thus, GCM discharges overestimated AWS discharges during 2003/2004–2007 and during the most recent years of the historical period.

In Table 5, the mean errors in hydrological modeling using the GCMs for the 4 decades, i.e., 2002–2011, 2003–2012, 2004–2013, and 2005-2014, are assessed. The lowest errors are found in the Adda catchment, while the highest are in the Brembo basin. Overall, the errors are quite small in all decades, except for the most recent one, since in 2014, precipitation was higher than the average value, resulting in very high modeled discharges when using AWS data (Figs. 4 and 5). The significance of the t-tests is investigated in Table 5, where the results are like those previously shown for the climate variables. By reducing the sample size, i.e., from daily to monthly aggregation, the p-values increase.

Table 5 Mean error on the yearly discharge from GCMs (QGCM) against its modeled value from AWS (Qobs) for each moving window and GCM. Student’s t-test is used, and the significant values are in bold. Results of t-test for discharge for each GCM. Daily (D) and monthly (M) aggregation is considered. Significant tests are marked with Y (α = 5%)
Fig. 5
figure 5

Mean annual discharges. Representation of p-value vs. mean error, calculated for the t-tests carried out with a daily resolution for each combination of GCM and year. The results for each basin are highlighted with different colors

However, it is interesting that small mean errors in the order of 2–3% are considered large in the tests when using daily samples, likely representing an effect of the large sample size (3652 ca. values). To extend the analysis, we combined the results of all t-tests of the basins by representing the relation between p-values and the mean errors (Fig. 5). The graph shows that the acceptable limit here for p-value (≥0.05) does not visibly identify a threshold in the mean error, and even p-values of the order of 2% gave relatively small errors. The choice of p-values <0.05 (for test failure) is quite customary, and it was taken here as a reference. However, Fig. 5 seems to demonstrate that, below the p-value ≤0.01 or so, considerably higher noise would be seen. Accordingly, one may hypothesize that a more discriminating power (i.e., to highlight projections displaying large inaccuracy) may be obtained by setting a threshold of p-value ≤0.01. Such a topic may deserve further verification.

4.2 Projected scenarios

We used the GCMs previously tested to project forward climate variables during 2015–2021 to further assess the goodness of the different SSP scenario projections. We calculated the average value and confidence limit for the 10 GCMs for each SSP, and we evaluated whether the observed data of climatic variables and modeled data of hydrological variables were within the confidence intervals of the GCM projections. We represented the mean observed values and the mean value averaged on the 10 GCMs for each SSP with their interval of confidence at 95%.

Supplementary Figs. A3 and A4 represent the average annual temperature and the cumulated annual precipitation over each of the four basins, calculated as the mean value for each SSP, averaged on the GCMs vs. the mean values assessed from ARPA weather stations. The results show that the mean observed temperature in all basins is well represented by the average value of GCMs for each SSP scenario. The observed precipitation presents a similar fluctuating pattern in all the basins with different absolute values, with the Adda catchment being the less rainy catchment. Similarly to temperature, the observed precipitation falls within the interval confidence of each SSP, but with a slight difference between the catchments and the year due to the erratic behavior of precipitation.

The uncertainty associated with the hydrological projections is reported in Supplementary Fig. A5. We compare the modeled projected discharges for each GCM with the modeled discharges with Poli-Hydro using the observed AWS weather data. Since the main driver of the hydrological model is clearly precipitation (Casale et al. 2021), the annual discharge trend is somewhat correlated with the precipitation pattern presented in each basin, and obviously, the lowest, least snowy watersheds display a stricter correlation.

5 Discussion

In this study, we provided a validation of the ability of GCMs to capture local climate and hydrological variability. Although several studies have attempted to evaluate the goodness of GCMs in representing climate change at a regional scale (Ruane and McDermid 2017; Prein et al. 2019), there are still concerns on how to classify whether a model is adequate or not (Knutti and Knutti 2010; Palmer et al. 2022). We used here an objective statistical approach to validate the control data series of GCMs (Groppelli et al. 2011a). The results of the statistical tests are aligned with previous studies, where a large number of values may result in considerably low p-values, leading, therefore, to possibly less statistically significant results. Indeed, by reducing the sample size, i.e., with monthly aggregation, most of the tests gave acceptable results. This seems legitimate, given the small noise that was generally found here. For climate variables, this is true more in terms of average value, whereas the variance of the precipitation is harder to reproduce.

Compared to other recent studies (Eyring et al. 2019; Brunner et al. 2019), we tried to take a step forward from the analysis of models’ climatic outputs to also validate hydrological projections, widely used in future water resource management (Ravazzani et al. 2016; Gianinetto et al. 2017; Bombelli et al. 2019; Stucchi et al. 2019; Aili et al. 2019).

Our results show that the goodness of GCMs in representing the local climate variability will not always reflect in a good representation of hydrology variables of the catchment area (using the same hydrological model as here). The analysis of climate patterns may give poor results, i.e., with a poor capacity of representing the water cycle of an area. When assessing hydrological projections, a seasonal validation of climate patterns is crucial since changes in the seasonality of precipitation have a paramount effect on stream flow discharge (Christensen and Lettenmaier 2007). Our results, both in terms of seasonality of climate patterns and hydrological projections, showed that the GCMs outputs are consistent with the observed data.

One may discuss whether the forcing scenarios adopted here in the control period (2002–2014) were coherent with actual environmental conditions as observed therein, which may (partly) explain the observed differences in Figs. 2 and 3. The GCM projections are, in practice, based on assumptions of GHG concentration (normally a bulk value valid overall on the earth’s surface), evolving in time according to the hypothesized SSP scenarios. Considering that Figs. 2 and 3 refer to the recent past, the GCMs most likely assumed levels of GHGs fairly equivalent to those now (e.g., for CO2, ca. 400 ppm, and so on for other gasses). Accordingly, one may assume that the reference (environmental) conditions as set out within the different SSPs would overlap in practice these values in the control period (however, with more and more different values moving toward the end of the century).

Accordingly, such an aspect would not largely affect the results of the simulation if not for the model’s inherent noise.

To broaden our knowledge of the uncertainty range, more hydrological models could be employed (Krysanova et al. 2020). However, a recent study in the same area demonstrated that uncertainty in flow projection is likely to be affected more by climate projections and internal variability than by the choice of the hydrological model (Casale et al. 2021). This result is coherent with other research that identifies the largest source of uncertainty in the short term, the choice of a given GCM rather than the choice of any hydrological models (Aryal et al. 2019).

The GCMs provide global projections; thus, to determine their validity in representing local climate, a downscaling procedure is most often needed. We used here a statistical downscaling method widely used in this area with acceptable results (Aili et al. 2019; Fuso et al. 2021; Bombelli et al. 2021). The methodology proposed here is space-independent, but the main findings are strictly correlated to the study area, underlying the importance of a local analysis in the evaluation of climate change effects. In the future, more downscaling procedures may be tested.

6 Conclusion

The study here presented demonstrates that the climate projections provided by models within the CMIP6 of the IPCC are (i) acceptably consistent, spatially, and temporally with observed data collected from AWSs in the study area of the Lombardy region and (ii) amenable to provide acceptable hydrological conjectures in the area.

Our comparison here was carried out for a period of 20 years, where 14 years covered a control period (it is therefore assessed the reliability of the GCM against historical observations available already in the simulation phase), and 7 covered a projection period (thus, ex-post validation was pursued). In both cases, the reliability of the GCMs (all subjected to an equivalent downscaling process) was evaluated. Subsequently, to introduce an additional measure of uncertainty and to also validate the results from a hydrological point of view, stream flows were also compared. The elaborations carried out aimed at assessing the general consistency of the GCM climate and hydrological scenarios on a wide spatial scale. It is clear that these scenarios represent projections valid in a purely statistical sense, which are useful to provide an array of potential conditions for planning and management at a region/basin scale. Moreover, the projection horizon considered here, of 7 years, is particularly short. Therefore, a statistical validation such as the one investigated here can be carried out in the years to come as new observed data become available. Given, however, the evidently very difficult (if even possible) task of projecting the future of climate and hydrology acceptably beyond short (e.g., seasonal) time scales, we provide here at least an indication that GCMs outputs, whatsoever noisy, could be at least considered acceptably accurate in a broad average sense and used for (whatsoever uncertain) water policy-making, which hitherto was rarely (if ever) demonstrated.

One may wonder whether outputs of CMPI6 models, as a new generation of GCMs, may/will bring improvements in model adequacy when using stochastic downscaling. Such a question may point seemingly to a specific comparison between CMIP5 (and former) and CMIP6 models. This was not the scope of our work, and indeed, we focused on CMIP6, given that, likely, in the future, most projection studies will be based on such models.

In the personal experience of the authors that have worked in the recent past in the field of hydrological modeling and projections using outputs from CMIP4-5-6, the latter GCM outputs from CMIP6 seemingly provide an improving depiction of climate and especially of precipitation cycle (likely the most critical, and yet important variable when dealing with hydrological conjectures), in terms of amount and timing. However, such qualitative perception is not backed up by specific testing.

Notice that stochastic downscaling provides a certain (noticeable) degree of (mathematical) manipulation of the GCM data, especially in order to provide a proper (time/space) variability of the precipitation process. Accordingly, after such modifications, the initial (GCM) data’s accuracy/credibility may be less relevant in terms of the final result. Such a topic seems worth some investigation in the future.