1 Introduction

Accurately characterising future climate is of crucial importance for medium and long-term water resource planning and management within the context of climate change (IPCC 2022). While General Circulation Models (GCMs) have emerged as powerful tools for climate prediction (Semenov and Stratonovitch 2010), they still exhibit certain limitations when it comes to representing regional climates affected by small-scale processes (Torma et al. 2015). Addressing this need, Regional Climate Models (RCMs) have been developed, based on dynamic downscaling of GCM models to provide high-resolution data. Notably, the CORDEX (Giorgi et al. 2009) initiative has successfully brought together RCM projects from around the world, boasting more than 70 simulations for the European region (Jacob et al. 2014, http://www.euro-cordex.net/).

Despite the clear advantages of RCMs over GCMs in capturing the primary features of regional climate (Kotlarski et al. 2014; Ciarlo et al. 2021), inherent uncertainties persist, extending beyond the scope of downscaling. These uncertainties encompass structural disparities in both GCM and RCM models (Knutti et al. 2008), the downscaling technique itself (Zhu et al. 2019), model parametrizations in reference to physical processes (Chen et al. 2011), and initial conditions, among other factors (Knutti et al. 2008; Dey et al. 2022). Furthermore, in studies conducted at the catchment scale, such as those examining the impacts of climate change on water resources, a scale mismatch remains, at times leading to unresolved climatic dynamics beyond the capabilities of RCM resolutions (Crawford et al. 2019). Consequently, these uncertainties can result in significant discrepancies in climate change projections between different RCMs, even when considering identical emission scenarios (Ruane and McDermid 2017). This, coupled with the scale mismatch that introduces limitations in climate representation, hampers the effective utilisation of this data for catchment-scale planning and water resource management (Venkataraman et al. 2016).

Impact modellers employ a wide array of methods to tackle these uncertainties and errors, encompassing a broad spectrum of complexities. These methods span from identifying the best-performing simulations within the study area (Crawford et al. 2019; Xu et al. 2020) to the utilisation of bias correction techniques with observational data (Dobor and Hlásny 2019; Teng et al. 2015; Piani et al. 2010), and extend to the development of Multi-Model Ensembles (MMEs)(Calì Quaglia et al. 2022; Salman et al. 2018). Bias correction methods have been instrumental in rectifying the systematic biases inherent in simulations (Piani et al. 2010). Nevertheless, they often prove less efficient in addressing non-stationary biases (White and Toumi 2013; Wang et al. 2018). A promising avenue for addressing the uncertainty of climate models lies in the development of MMEs, which have the potential to mitigate uncertainties and enhance the confidence in climate projections (Pavan and Doblas-Reyes 2000; Lutz et al. 2016; Sanderson et al. 2015; Keller et al. 2019). MMEs are categorised into two distinct groups: SEM (Simple Ensemble Mean) and WEM (Weighted Ensemble Mean). In the former approach, all ensemble members are uniformly assigned equal weights, whereas in the Weighted Ensemble Method (WEM), each member is allocated a distinct weight determined by its proficiency in replicating past climate conditions (Oh and Suh 2017; Ahmed et al. 2020). SEM, known for its simplicity, is a commonly employed method, which provides an overall better performance than individual members (Lambert and Boer 2001). However, it comes with certain limitations. Many of the models share model parameterizations and components, which can lead to interdependencies between different climate simulations (Sanderson et al. 2015). Failing to account for this interdependence may result in misleading model consensus, reduced accuracy, and a flawed estimation of uncertainty (Herger et al. 2018). Moreover, SEM may not be suitable for all applications, as it significantly diminishes the spatial and temporal variability of information when compared to individual members and observational data (Wang et al. 2018).

In contrast, WEM methods have demonstrated their capacity to mitigate the impact of systematic biases within individual members and even enhance the ensemble’s predictive capabilities (Krishnamurti et al. 1999, 2000). The use of Machine Learning algorithms to generate a Multi-Model Ensemble (ML–MME) is an emerging technique in climate simulation (Zhu et al. 2023; Sand et al. 2023). These algorithms have a significant potential to enhance the outcomes of climate simulations, especially in relation to its potential advantages in dealing with non-linearity between response variables and predictors (Ahmed et al. 2020; Sachindra et al. 2018; Xu et al. 2020). Krishnamurti et al. (1999) established a precedent of an MME based on multiple regression techniques to improve the 850 hPa meridional wind speed and precipitation simulations of eight general circulation models, obtaining superior results over the ensemble mean. Wang et al. (2018) employed four Machine Learning (ML) techniques to develop MMEs for mean monthly temperature and mean monthly precipitation by considering 33 CMIP5 GCMs over Australia and reported that Random Forest (RF) and Support vector machine (SVM) demonstrated a significant improvement over the ensemble mean, which is in agreement with the results reported by Sa’adi et al. (2017) who employed a Generalised Linear Model (GLM) to construct their MMEs obtaining better results for the MMEs than for the 20 individual members of the CMIP5 GCMs over Borneo Island, Malaysia. Results along these lines have been reported in studies in Iraq for monthly mean temperature (Salman et al. 2018), in Pakistan for monthly precipitation (Ahmed et al. 2020), or in the Gulf Basin and North America for both (Crawford et al. 2019). Daily scale studies also show favourable results for ML–MME techniques (Jose et al. 2022). Likewise, Dey et al. (2022) obtained significant improvements in the characterisation of these climate variables with data from CMIP6 GCMs.

In our study, we ventured into a novel approach by applying various ML–MME methods to RCMs for the first time. These methods were then further applied to a hydrological model. We subjected them to a comparative analysis against the SEM (Simple Ensemble Mean) approach, focusing on monthly precipitation (pr), the monthly average of daily maximum temperature (tmax), and the monthly average of daily minimum temperature (tmin). Specifically, the ML–MME techniques encompassed Linear Regression (LR), Gradient Boosting (GB), and Random Forest (RF). This investigation is particularly noteworthy as we apply it to a complex topography region, which adds a layer of novelty to our research given the challenges it presents for simulation (Torma et al. 2015; Reder et al. 2020). First, a ranking of the RCMs has been developed based on their skill to characterize the past climate and the optimal number of RCMs to be included in the ML-MMEs has been determined. Once the final ML–MMEs for the three variables have been defined, the monthly series were analysed in detail by comparing them with the climate observations. To illustrate the practical utility of the ML–MMEs in the application of impact studies at the watershed scale, we employed them as input data for the Temez hydrological model, both for historical periods and future climate projections within the study area.

2 Data and Study Area

We considered the EURO-CORDEX ensemble (Jacob et al. 2014, 2020), with a total of 72 RCM simulations (Table 1) with a spatial resolution of \(0.11^\circ \times 0.11^\circ\). These simulations cover a time period of 130 years, from 1980 to 2100 for RCP8.5 and is based on the combination of two models, the RCM and the driver model, the GCM, forming an incomplete matrix of 12 RCM and 8 GCM models.

CLIMPY observational dataset (Cuadrat et al. 2020) is used as reference, with a spatial resolution of 1 km \(\times\) 1 km on a daily basis covering the period 1980–2015. It is a reconstruction (Serrano-Notivoli et al. 2017) of the variables based on the information from 1,343 meteorological stations located in Spain, France, and Andorra. This dataset was created in the framework of the transboundary project CLIMPY and has already been validated in different studies (Amblar-Francés et al. 2020; Lemus-Canovas et al. 2019). For the proper comparison between the data from simulations and observations, both must be on the same grid. Thus, an bilinear interpolation to a the rectilinear grid o the RCMs of \(0.11^\circ \times 0.11^\circ\) resolution of CLIMPY has been performed.

The Esca River basin is located in the western Pyrenees, northeastern Spain, and covers an area of 425 \(\hbox {km}^2\), which corresponds to four grid-cells of the climate datasets. Characterised by a large altitudinal gradient, the elevation of the highest point of the basin is 2,100 m, while its lowest point is 595 m above sea level. Orographic characteristics make this type of basin remarkably difficult to simulate its climate dynamics (Kotlarski et al. 2014; Smiatek et al. 2016) Therefore, they are particularly problematic areas for accurately predicting the future climate and its related impacts in hydrology (Fatichi et al. 2016). It is important to make efforts to overcome these difficulties, particularly in cases such as the Esca river basin, since it is a key tributary feeding the Yesa reservoir, the primary reservoir in the western Pyrenees. Data on streamflows of the Esca river were available from the website of the Spanish Centre for Hydrographic Studies (CEDEX) (https://ceh.cedex.es/anuarioaforos/default.asp), where data are updated to September 2017.

The selection of the variables pr, tmax and tmin is motivated by two primary considerations. Firstly, their availability within the CLIMPY database (Cuadrat et al. 2020). Secondly, these variables are pivotal for characterizing the climate system, as emphasized in prior studies (Meehl et al. 2000; Perkins et al. 2007; Careto et al. 2022b, a), and play a crucial role in influencing diverse hydrological (Piani et al. 2010), biological, and industrial systems (Colombo et al. 1999; Coppola et al. 2021).

3 Methodology

This study follows a specific methodology, which progresses through several phases:(1) a ranking of the RCMs was developed according to the performance of the three analyzed variables—tmax, tmin, and pr—on a seasonal scale (Sect. 3.1), (2) the SEM and the ML–MMEs were constructed (Sect. 3.2) and (3) the optimal number of RCMs to form the MMEs was chosen (Sect. 3.3). (4) The definitive MMEs were evaluated (Sect. 3.4). Then, (5) to assess the impact of climate variables MMEs on flow characterization, we utilized these MMEs as input data for the Temez hydrological model (Sect. 3.5). Finally, (4) as an illustrative example of application of ML–MME results for climate change impact assessment, the definitive ML–MME algorithms were applied to the climate projections of the RCP8.5 emissions scenario.

Fig. 1
figure 1

Outline of the stages of the data analysis process followed in the work

The methodology proposed in the described 1, 2, 3 and 4 steps follows an outline of the data analysis processes (Berthold et al. 2010) presented in Fig. 1. The methodology initiates with a feature selection process aimed at eliminating noise-inducing features (RCMs) from the dataset, thus ensuring the development of a stable and reliable prediction model. This involves conducting an RCM ranking followed by the application of a filter-wrapper technique to identify the most suitable features. Upon selecting the optimal RCMs, various ML models are generated by optimising their hyperparameters using cross-validation. Subsequently, MMEs of tmax, tmin and pr are generated using the developed ML algorithms. These MMEs were subjected to an statistical performance evaluation.

3.1 Ranking of RCMs

Within intelligent data analysis, one of the first phases is data pre–processing. In this instance, a selection of characteristics was applied to create an RCM ranking and to select those with the most relevant information for the attainment of a reliable predictive model. The procedure followed entails filter-wrapper processing, which consists of two parts: the filter part and the wrapper part. Initially, a ranking was created using a quantitative measure (filter part), and subsequently, the most relevant ones were selected (wrapper part– Sect. 3.3). Thus, the following procedure was applied to rank the RCMs according to their performance based on the observational data: The time series of pr, tmax, and tmin were divided into the four seasons representative of the Atlantic climate of the study area, namely, winter (DJF), spring (MAM), summer (JJA) y autumn (SON). For each variable and season the TSS (Taylor Skill Score, Taylor 2001) was calculated (filter index). The TSS provides a quantitative measure of the ability of each RCM to simulate the variables pr, tmax, and tmin. It is based on the correlation and the ratio of the standard deviation of the RCMs against the observations of a given climate variable:

$$\begin{aligned} \text {TSS}=\frac{4{(1+R)}^4}{{(\sigma _f+1/\sigma _f)}^2 {(1+R_0)}^4} \; , \end{aligned}$$
(1)

where \(\sigma _f\) refers to the ratio of the standard deviation of the RCMs versus the observations, R refers to the Pearson correlation coefficient, and \(R_0\) represents the maximum value of the correlation, namely 1. TSS ranges from 0 to 1. A higher value indicates better simulation performance, while a lower value indicates worse performance. Based on the TSS results, twelve rankings were obtained, one per variable and season, which were taken into account to calculate the metric value rating RM (Ahmed et al. 2020):

$$\begin{aligned} \text {RM} =1-\frac{1}{nm}\sum _{i=1}^{n} \text {rank}_i \; , \end{aligned}$$
(2)

where n and m represent the number of RCMs and seasons, respectively, while \(\hbox {rank}_i\) refers to the number of the ranking corresponding to the member at the \(i^{th}\) season. Finally, the RCM members were ordered according to the RM. As a result, we obtained a ranking of the RCM models ordered from best to worst according to their performance in relation to observational data in the studied basin.

3.2 Development of SEM and ML–MME Algorithms

After developing the ranking of the RCMs, the MMEs structure and characteristics have been designed. In the first place, when formulating ML–MME algorithms, it is crucial to account for the seasonal dynamics inherent in the variables. This consideration enhances the algorithms’ ability to discern patterns of variability. Due to the evident interannual temperature dynamics in out mid-latitude region, we have opted to consider the seasons independently, specifically for tmin and tmax when constructing the ML–MME algorithms (Morales-García et al. 2023; Ahmed et al. 2020). Conversely, with precipitation, we have pursued an alternative strategy: Given the complex nature of this variable and the alterations observed in the annual cycle over recent decades in European mid-latitudes (Christidis and Stott 2022; Paluš et al. 2005), establishing clear seasonal patterns becomes a more intricate task. Designing ML algorithms solely based on the seasons might prove misguided, potentially hindering the algorithms from accurately capturing the variable’s behaviour.

To address this complexity and unbalance of the data, we have chosen to consider monthly precipitation events categorising them into two subgroups Chao et al. (2018): those exceeding the \(80^{th}\) percentile and those below it according to observational data. Through the separation of precipitation into two distinct databases, the range of the variable was reduced, leading to increased accuracy in the results obtained by the ML models. Following this rationale, each ML–MME technique has resulted in four algorithms for tmax and tmin, corresponding to each season. Additionally, two algorithms have been generated for precipitation: one for events within the 0–80 percentile interval and another for events in the 80–100 percentile interval.

Different methods were applied to construct the MME on a monthly scale, including, on the one hand, SEM, and on the other hand, three ML techniques: RF, GB and LR. The first MME development technique is the SEM, commonly and widely used for MME calculation (Clark 2017). The remaining three techniques are more elaborate and are based on ML regression models. These three techniques are detailed below:

  • Random Forest (RF). RF is a machine learning technique whose basis is a combination of predictor trees such that each tree depends on the values of a random vector tested independently and with the same distribution for each of them. It is a substantial modification of bagging that builds a large collection of uncorrelated trees and then averages them. The algorithm for inducing a random forest was developed by Breiman (2001). Bagging is the ensemble learning method typically used to reduce the variance within a noisy data set. The RF method combines the idea of bagging and random attribute selection to build a collection of decision trees with controlled variation. The selection of a random subset of attributes is an example of the random subspace method, a way to perform stochastic discrimination (Breiman 2001).

  • Gradient Boosting (GB). GB is a machine learning technique for regression analysis and statistical classification problems based on boosting. Boosting consists of combining the results of several weak classifiers to obtain a robust classifier. When these weak classifiers are added, they are added in such a way that they have different weights depending on the accuracy of their predictions. After a weak classifier is added, the data changes its weight structure: cases that are misclassified gain weight and those that are correctly classified lose weight. Thus, the strong classifiers focus more strongly on the cases that were misclassified by the weak classifiers. The GB technique creates a predictive model based on weak prediction models, usually decision trees. The GB is an ensemble that provides a set of prediction models, which conclude a satisfactory prediction outperforming in some cases the random forest ensemble (Bentéjac et al. 2021).

  • Linear Regression (LR). LR is a supervised learning algorithm used in machine learning and statistics. In its simplest version, it calculates a line that will indicate the trend of a continuous data set. LR can be defined as an approach to model the relationship between a dependent scalar variable and one or more explanatory variables. The LR technique should minimise the cost of a quadratic error function and those coefficients will correspond to the optimal line. There are several methods to minimise the cost. The most common is to use a vector version and the so-called Normal Equation which will give a direct result (Weisberg 2005).

For the selection of the hyperparameters of the machine learning techniques, a Grid has been used by means of cross-validation to sweep through all the parameters and thus select the most optimal ones.

3.3 Selection of RCMs

After the RCM ranking was completed and the MMEs characteristics defined, the process of selecting the optimal number of RCMs to be considered when creating the MMEs for each variable (tmax, tmin and pr) was initiated. This process is the wrapper part of feature selection presented in Fig. 1. The MMEs were developed considering the RM-based rank of RCMs from 1 to 40 (Table 1). Initially, only the outputs of the RCM with a rank of 1 were used to provide inputs to the MME. Subsequently, the outputs of the RCM with a rank of 2 were added to the input set, followed by the incremental introduction of RCMs with overall ranks 3, 4, 5... 40 into the input set, one RCM at a time. This approach, known as the top-ranked approach (Ahmed et al. 2020), started with the best-performing RCM (rank 1) and progressed with subsequent RCMs in ascending order of their RM-based rank.

The evaluation of the performance of the MME outputs, generated with varying numbers of RCMs, has been conducted on the reconstructed time series. This reconstruction of the results obtained by the MME has been carried out by transforming the data divided into seasons (tmax, tmin) or percentile intervals (pr) described in Sect. 3.2. into a time series.

The evaluation metric was the Modified Index of Agreement (md, (3)), which was initially proposed by Willmott (1981) and has been later widely applied (Ahmed et al. 2020). It ranges from 0 to 1, with higher values indicating a better fit of the model

$$\begin{aligned} md=1-\frac{\sum _{i=1}^{n}(x_{\text {obs},i}-x_{\text {sim},i})^j}{\sum _{i=1}^{n}(|x_{\text {sim},i}- {\bar{x}}_{\text {obs}}|+|x_{\text {obs},i}- {\bar{x}}_{\text {obs}}|)^j} \; , \end{aligned}$$
(3)

where \(x_{\text {sim},i}\) and \(x_{\text {obs},i}\) are the \(i^{th}\) data point in the simulated RCM and the observed data series of a climate variable, respectively. It has been calculated for the four grid cells considered in this study.

With this procedure all RCMs are incoporated into the MMEs. Then the cut-off point is made just at the RCMs that start to worsen the md metric or when an overfitting issue is observed. This indicates that from that RCM onwards, the information provided by the other RCMs is more noisy than beneficial.

3.4 Evaluation of SEM and ML–MME Algorthims

Once the selection phase was completed and the definitive MMEs were built, the evaluation was carried out. The data was divided in the training and testing phases, representing 80 % and 20 % of the data, respectively, divided chronologically. Therefore, the training phase covered the period of 1980–2006 while the test phase covered the period of 2007–2015. Notably, data from all four points in the mesh have been incorporated to feed the algorithms. Moreover, the evaluation was carried out with three additional metrics commonly used in the characterisation of time series similarities: the coefficient of determination (\(R^2\)), the root-mean-square error (RMSE), and the root mean square percentage error (RMSEPE).

3.5 Application of ML–MME data to Temez Hydrological Model

The Temez model (Témez 1977), extensively applied in Spanish watersheds (Pérez-Sánchez et al. 2019; Escriva-Bou et al. 2017; Chavez-Jimenez et al. 2013; García-Barrón et al. 2015; Jódar et al. 2017; Marcos-Garcia et al. 2017; Senent-Aparicio et al. 2018), falls within the category of aggregated watershed simulation models (Estrela 1992). Operating from the onset of rainfall to the initiation of runoff and subsequent discharge into rivers, the Temez model manages moisture balances across interconnected processes within a hydrological system. Input variables for the Temez model encompass the spatial average monthly precipitation for the entire basin and Potential Evapotranspiration (ETP). In line with the current investigation’s focus on monthly climate data, ETP was determined using the Thornthwaite method (Thornthwaite 1948).

We assessed the hydrological model’s outcomes based on four widely adopted evaluation criteria in hydrological research (Jimeno-Sáez et al. 2018). These criteria include the Nash–Sutcliffe Efficiency coefficient (NSE), the percent bias (PBIAS), the Pearson correlation coefficient (r), and the Kling–Gupta Efficiency coefficient (KGE).

After the evaluation of the four proposed ML–MME techniques, the algorithms were applied to future climate projections for the RCP8.5 emission scenario for long-term future and were utilized as input data for simulating future streamflow.

4 Results and Discussion

4.1 Ranking of RCMs

Table 1 presents the RCM rankings based on TSS across the DJF, MAM, JJA, and SON seasons for the variables tmin, tmax, and pr. Notably, substantial variations emerge among seasons and variables. In certain instances, an RCM that excels in simulating one variable and season finds itself at the lower end of the ranking when compared to other variables and seasons. A case in point is IPSL–RCA4 (Code 33), which stands out as the top performer in simulating precipitation during SON and JJA, as well as maximum temperature in SON. However, it exhibits inefficiencies in comparison to other RCM members when simulating precipitation in DJF and MAM (Kotlarski et al. 2014).

A notable observation is the high contribution of the GCM driver on the ranking position, which is in line with what is stated by Vautard et al. (2021), who established that some variables are conditioned by large-scale boundary conditions defined by the GCMs. For instance, RCM members driven by the MPI–ESM–LR GCM consistently achieve the highest RM values (Table 1), indicating superior overall performance. This aligns with findings from Brands et al. (2013), underscoring the GCM’s excellent ability to simulate precipitation over European mid-latitudes. A poor RCM performance, however, can also have a significant impact on the simulation, as in the case of the 60 and 48 models which, despite having the MPI as driver, occupy poor positions in the ranking. In the same way, RCMs with CNRM–CM5 as driver also rank high, because they are able to adequately characterise the temperatures (McSweeney et al. 2015). Conversely, a GCM with deficiencies in simulating climate conditions adversely affects the ranking of RCMs that are driven by it. Such is the case with MOHC–HadGEM2, which exhibits notable biases in climate variables representation. Consequently, MOHC–HadGEM2 attains lower positions across all variables and seasons.

Table 1 Individual ranks of RCMs for each variable and season (DJF, MAM, JJA and SON) based on TSS and overall ranks of RCMs based on RM values according to their ability to simulate CLIMPY monthly precipitation (pr), monthly average of daily maximum (tmax) and minimum temperature (tmin) over the study area over the period 1980-2015

4.2 Selection of the Optimal Number of RCMs

To extract meaningful insights for determining the optimal number of RCMs to include in further analyses, we conducted an examination of the ML–MME learning curve. All machine learning techniques described previosly have been used to select the number of RCMs. As depicted in Fig. 2, the md values, relative to observations, are plotted against the number of RCMs utilised to construct the SEM and the ML–MMEs. The incorporation order of RCMs follows a top-ranked approach (Ahmed et al. 2020). Notably, for fewer than three RCMs, the md values exhibit a substantial increase initially, stabilising thereafter to an asymptotic trend for most ML techniques across all variables and periods. An exception is observed with GB, where, beyond a certain quantity of RCMs (for pr 16, for tmax 35 and for tmin 25), the md values approach 1. This indicates overfitting (Ying 2019; Dietterich 1995).

Fig. 2
figure 2

md vs. the number of RCMs for precipitation (pr), maximum temperature (tmax) and minimum temperature (tmin). The shaded area represents the standard deviation of the four grids of the mesh

Upon closer examination of individual variables, precipitation stands out with notable differences between SEM and ML–MME. SEM records md values near 0.4, while ML–MME techniques yield values ranging from 0.6 to 0.8 (excluding the overfitting case of GB). For temperature variables, the initial md is higher, approximately 0.6, indicating that RCMs exhibit a greater capacity to replicate monthly temperature patterns compared to precipitation. This difference arises due to the higher complexity inherent in the dynamics of precipitation, which poses challenges for numerical models to simulate accurately (Perkins et al. 2007; Aghakhani Afshar et al. 2017), specifically affecting RCMs (Vautard et al. 2021; Herrera et al. 2020; Kotlarski et al. 2014). While improvements are observed in temperature variables with ML–MME, the contrast in md values is less pronounced, particularly for minimum temperature.

After reviewing the evolution of result improvements concerning the number of RCMs, and recognising a plateau after the initial progress, we opted to include a total of seven RCMs. This decision is motivated also to avoid instances of overfitting, as observed with GB for tmin variable, while maintaining a balance between model complexity and predictive performance. The number of models utilized aligns with the findings of Dey et al. (2022), who, following a pre-selection process, incorporated 5 models into their analysis. Likewise, Ahmed et al. (2020) achieved comparable results in their precipitation analysis, drawing from data generated by 7-10 models exhibiting high performance.

Fig. 3
figure 3

Taylor diagrams of the spatial average of the variables precipitation (pr), maximum temperature (tmax) and minimum temperature (tmin) for the training (1980–2006) and test (2007–2015) periods

4.3 Evaluation of SEM and ML–MMEs

Figures 3, 4, and 5 offer an assessment of the SEM and ML–MME results relative to CLIMPY observations for the variables pr, tmax, and tmin. To enhance result clarity, we focused on evaluating the spatial average of pr, tmax, and tmin within the study area. Notably, in the first column of Fig. 3, the Taylor diagram for precipitation during both the training and test periods indicates substantial enhancements resulting from ML–MME application compared to SEM. Among the ML–MME methods, RF and LR yield comparable outcomes, while GB achieves the most favourable results at the annual scale for both training and test periods.

Fig. 4
figure 4

Spatially averaged observed precipitation and simulated precipitation time series and evaluation metrics (SEM and ML–MME) for the training (1980–2006) and test (2007–2015) periods

Concerning the spatial average of temperatures, Taylor diagrams do not reveal appreciable improvements. Both SEMs of tmin and tmax already exhibit statistics indicative of a robust representation of monthly temperatures in the study area, attributed to the high-quality simulations of the pre-selected RCMs (Table 1). The exceptional starting point of RCMs’ simulation quality may limit the potential enhancement capacity that ML–MMEs could offer.

For a more detailed analysis of precipitation performance, Fig. 4 presents monthly time series plots of the spatial average results for SEM and ML–MME. The improvement across all ML–MMEs in comparison to SEM is evident. Whereas SEM exhibited a fit close to zero, high RMSE, and md below 0.5 in both periods, all ML–MME techniques demonstrate significantly improved performance, indicating their superior ability to simulate monthly precipitation patterns. Notably, GB achieves the best md results, with values of 0.88 and 0.75 for the training and test periods, respectively. RF, however, is not far behind, boasting an \(R^2\) in the test period of 0.80, surpassing GB’s 0.75. Despite LR showing higher RMSE values (around 44 mm/month) and a lower capacity to detect precipitation minima and maxima, the ML–MME based on LR markedly improves the representation of the study area’s precipitation compared to SEM. These results are in line with those obtained in several studies (Acharya et al. 2014; Salman et al. 2018; Li et al. 2021). For instance, Dey et al. (2022) developed ML-based MME approaches for CMIP6 in an Indian River basin obtaining that the RF-based ML–MME showed improved performance compared to SEM. In the same vein, Jose et al. (2022) proposed RF as the best suitable ML model over India for creating MME and simulating the past observed climate variables, in a tropical river basin. In addition to studies conducted at basin scales, ML–MME approaches have also been applied at broader spatial scales. This is the case of Wang et al. (2018) who applied SEM, BMA, RF, and SVM with CMIP5 data over Australia, concluding that RF and SVM could generate better-performing MMEs compared to SEM and BMA.

Figure 5 provides a thorough evaluation of SEM, ML–MMEs, and the seven individual RCMs, both at the annual and seasonal scales during the test period. Notably, when comparing SEM with the ML–MME techniques, a widespread enhancement is observed, particularly in precipitation. For instance, the DJF season, which records the lowest md values (around 0.2) for individual RCMs, sees substantial improvement with ML–MME techniques, elevating md to approximately 0.55 for RF and LR, and surpassing 0.70 for GB. This improvement is consistent across all seasons and holds true for annual data as well. Similarly, \(R^2\) and RMSE exhibit substantial enhancements across the board for precipitation. The coefficient \(R^2\), which occasionally dips to 0 for certain RCMs and seasons, now consistently remains above 0.6 for all seasons and ML–MME techniques, reaching annual values of 0.8. The RMSEPE, expressed as a fraction, which exceeds 3 in some individual RCMs, is consistently below 1 for all ML–MME cases. This noticeable and significant improvement in the characterization of precipitation at both seasonal and annual levels, as evidenced by the three metrics analyzed in the study region, represents a significant qualitative advantage offered by ML-MMEs compared to the results obtained from individual RCM members. This enhancement could potentially yield significant benefits for regional planning, including water and agricultural management, as well as climate risk preparedness, among others.

For temperatures, while no notable seasonal improvement is evident in r and md, annual values display enhancement for both tmax and tmin. However, the improvement in simulation quality, even at the seasonal scale, is manifested as a decrease in the RMSE values. Individual RCMs exhibit RMSE values ranging from \(2.0~^{\circ }\text {C}\) to \(5.2~^{\circ }\text {C}\) for tmax. Post-application of ML–MME techniques, RMSE is drastically reduced, with values between \(0.8~^{\circ }\text {C}\) and \(3~^{\circ }\text {C}\). A parallel behaviour is observed for tmin. This improvement in temperature representation is of particular interest in an area like the analyzed study region, where the presence of snow and snowmelt processes are key factors directly dependent on temperatures, greatly influencing regional management.

Fig. 5
figure 5

Heat maps representing the md, \(R^2\), RMSE (tmax, tmin) and RMSEPE (pr) obtained from the comparison of the observations versus the SEM, the ML–MMEs and the individual RCMs for the test (2007–2015) period

In each examined case, MMEs consistently outperform individual members, even when represented by the least effective MME, SEM. This observation is supported by numerous studies that emphasise the MME’s ability to enhance individual member performance and reduce climate output uncertainties. Notable analyses include regions such as India (Gusain et al. 2019), the USA (Srivastava et al. 2020), China (Zhuang et al. 2016), and Europe (Evin et al. 2021). Additionally, our results indicate that ML–MME exhibits superior performance to SEM, particularly for precipitation, as depicted in Figs. 4 and 5. This finding underscores the ML–MME’s relevance at the catchment scale. The enhanced performance of ML–MME over SEM may be attributed to ML approaches’ capacity to address nonlinear, high-dimensional correlations between climate model outputs and observational datasets (Dey et al. 2022). Moreover, as highlighted by Li et al. (2021), ML–MME algorithms could be able to capture detailed information at local scales due to the incorporation of high-resolution observations on the construction of ML–MME algorithms.

In this study, we successfully integrate the EURO-CORDEX RCMs, the climatic simulations with higher spatial resolution for the study area, with the strengths of ML mathematical algorithms. This combination holds promise for reducing uncertainty in basin-scale climate projections. In the following section (Sect. 4.4), we utilise the outputs of the ML–MME algorithms to feed a hydrological model within the Esca River basin.

4.4 Application of SEM and ML–MME Climate Data to Temez Hydrological Model

4.4.1 Temez Model Setup

For the model setup development, the simulation period was divided into two distinct phases: the calibration period, spanning from 1981 to 2000, and the subsequent validation period, covering 2001 to 2014. A warm-up year was introduced to attain a stable state for the Temez model. Calibration focused on adjusting four key parameters: \(H_{\text {max}}\) (maximum soil storage capacity), C (surplus starting coefficient), \(I_{\text {max}}\) (maximum infiltration) and \(\alpha\) (groundwater contribution coefficient). The first two parameters govern soil storage regulation, the third distinguishes surface runoff from groundwater runoff, and the fourth modulates subsurface drainage (Murillo and Navarro 2011). Table 2 presents the metrics described in Sect. 3.4 for the comprehensive assessment of hydrological simulation.

Table 2 Calibration (1981–2000) and validation (2001–2014) results for the Temez hydrological model

According to what was established by Moriasi et al. (2007) and Brighenti et al. (2019), the performance of the model both in the calibration and validation period is satisfactory since the results of NSE and KGE exceed 0.5 and the PBIAS reaches its maximum in the calibration period with \(-\)12.76 %, remaining below the ±25 %.

4.4.2 Evaluation of Streamflow for SEM and ML–MME Input Data

Starting from the calibrated and validated Temez model, the simulations described bellow have been carried out in order to evaluate the impact of the climate-corrected data, which have been analysed in detail in section 4.3, on the characterisation of the flow variable. First, the monthly flow simulation has been developed by feeding the Temez model with data from precipitation observations and with the ETP derived from the tmax and tmin observations, denoted as \(Q_{\text {sim-OBS}}\). Following the same approach, four additional flow simulations, subsequently identified as \(Q_{\text {sim-SEM}}\), \(Q_{\text {sim-GB}}\), \(Q_{\text {sim-LR}}\) and \(Q_{\text {sim-RF}}\), were developed. Each simulation incorporated input data derived from MME techniques: SEM, GB, LR, and RF, respectively. To facilitate the explanation, another term has been incorporated that refers to the group formed by the simulated flows using the climate data derived from the ML–MME, \(Q_{\text {sim-ML}-\text {MME}}\).

Table 3 presents the statistics of the described simulations for the training period (1980–2006) and test period (2007–2015) of the ML–MME algorithms. The choice of these specific periods aligns with the study’s focus on improving climate representation through ML–MME techniques and assessing the extent to which these improvements influence streamflow representation. The congruence in analysis periods for both climate variables and flow variables enhances the study’s coherence. From the analysis of the statistics in Table 3 it is concluded that while the \(Q_{\text {sim-SEM}}\) obtains unsatisfactory results for both periods, the ML–MMEs manages to enhance the representation of the flow significantly. Notably, both \(Q_{\text {sim-RF}}\) and \(Q_{\text {sim-GB}}\) exhibit statistics comparable to \(Q_{\text {sim-OBS}}\), with NSE remaining above 0.60 for the training period and r achieving values exceeding 0.74 in both periods. The \(Q_{\text {sim-LR}}\) simulation, although satisfactory, yields inferior results with higher PBIAS and lower NSE and KGE values. These outcomes signify that the improvements in climate variable representation by ML–MMEs propagate and significantly enhance flow characterisation in both the training and test periods.

Table 3 Statistics of simulated vs. observed streamflows for the training (1980–2006) and test (2007–2015) periods

To further assess the performance of the hydrological simulations, the annual cycle for the four \(Q_{\text {sim-ML--MME}}\) together with the \(Q_{\text {sim-OBS}}\) and \(Q_{\text {OBS}}\) have been depicted in Fig. 6. The latter refers to the observed flow rates. It is observed how in the training period (1980–2006) the annual cycle of the streamflow consists of two maxima in January and May and a minimum recorded in August and September. This interannual dynamics is captured by the calibrated and validated Temez model for the \(Q_{\text {sim-OBS}}\) simulation. If we pay attention to the \(Q_{\text {sim-MME}}\), we observe that while \(Q_{\text {SEM}}\) fails to characterise the annual cycle with a generalised overestimation of the flow that extends over most of the year, the other \(Q_{\text {sim-MME}}\) accurately reproduce the hydrological cycle of the Esca river. The annual cycle of the test period (2007–2015) presents differences with respect to the training period, especially in the spring maximum, which is more accentuated and reaches 60 \(\text {Hm}^{3}\). The Temez model with input data from climate observations (\(Q_{\text {sim-OBS}}\)) has more difficulty in simulating the hydrological cycle for this period, although it roughly succeeds in characterising it. The \(Q_{\text {sim-ML--MME}}\) simulations accurately reproduce the \(Q_{\text {sim-OBS}}\) cycle, especially \(Q_{\text {sim-GB}}\), while \(Q_{\text {sim-SEM}}\) demonstrates poor performance. In essence, the \(Q_{\text {sim-ML--MME}}\) reproduce the interannual dynamics captured by the Temez model in the \(Q_{\text {sim-OBS}}\) simulation, thus demonstrating that the improvements achieved in the climate representation derived from the application of ML–MME techniques have a positive impact on the characterisation of the hydrological cycle. On the other hand, it is important to highlight that the differences derived from the flow observations (\(Q_{\text {OBS}}\)) and the simulations are attributed to the errors provided by the Temez model, probably related to the misrepresentation of snow accumulation and melting processes by the hydrological model (Jimeno-Sáez et al. 2020).

Fig. 6
figure 6

Annual cycle of streamflow for the training (1980–2006) and test (2007–2015) periods. Results are shown for observational flow data (\(Q_{\text {OBS}}\)), Temez-simulated flow with input data from CLIMPY climate observations (\(Q_{\text {sim-OBS}}\)) and Temez-simulated flow with input data from SEM and ML–MMEs (\(Q_{\text {sim-MME}}\)). The shaded area represents the annual variability of the streamflow results

4.5 Future Projections of Climate and Hydrological Variables

Thus far, it has been demonstrated that the utilization of ML-MME techniques has not only enhanced the representation of climate variables but has also significantly improved the accuracy of hydrological characterization during the historical period in the study area. Extending this methodology to future scenarios under the RCP8.5 emission scenario suggests that projections from trained ML–MME models may offer more realistic data than those from the SEM (Liang et al. 2008).

Figure 7 illustrates the annual cycles of the analysed variables-pr, tmax, tmin, and Q-for two distinct periods: historical (1986–2015), and long-term future (2065–2095). This figure juxtaposes simulation data from the ML–MME techniques with observational data from the historical period. A comparative analysis reveals that ML–MME techniques better characterise climatic patterns compared to the SEM. Specifically, while the SEM tends to overestimate precipitation during DJF and MAM, the ML–MME captures the interannual dynamics more accurately, manifesting two peaks in April and November and a minimum that extends from June to August (Lemus-Canovas et al. 2019). Similarly, ML–MME techniques more precisely replicate interannual temperature variations. Further, the ML–MME techniques positively influence the streamflow annual cycle representation by the Temez model in the study area. Indeed, simulations driven by the SEM consistently exhibit overestimations, as discussed in Sect. 4.4, whereas RF–MME, GB–MME, and LR–MME demonstrate markedly superior performance (Fig. 7).

Fig. 7
figure 7

Annual cycles of pr, tmin, tmax and Q for historical and long-term future (RCP8.5 emission scenario) covering 1986–2015, 2066–2095 respectively. The shaded area for Q variable represents the annual variability of the results

These results and those analysed in sects.  4.3. and 4.4. indicate that the ML–MME techniques provide more realistic information than SEM, also for the projections of the RCP8.5 emission scenario. If we focus on RF and GB we see that according to these projections, precipitation will decrease throughout the year except for DJF and MAM where will increase, thus modifying the interannual precipitation patterns. Concurrently, temperatures are expected to rise consistently (Amblar-Francés et al. 2020; Lemus-Canovas and Lopez-Bustins 2021), with minimum temperatures notably increasing in March and April. These shifts in interannual dynamics will likely reshape the hydrological cycle, resulting in a more pronounced summer minima and intensified, albeit shorter-duration, maxima in February and March, as projected by RF and GB and in line with the results obtained in numerous Pyrenean Rivers (López-Moreno et al. 2014; García Ruiz et al. 2001; Stahl et al. 2010; Zabaleta et al. 2017; Boé et al. 2009; OPCC-CTP 2018). While the simplicity of this hydrological modeling approach, coupled with monthly-scale analysis, limits our conclusions to informative insights, it also highlights the potential of integrating ML-MME techniques with more intricate hydrological models on a daily scale thus paving the way for the development of projections that can facilitate more precise resource-planning and adaptation strategies in the context of climate change.

5 Conclusions

In this study, we effectively implemented machine learning algorithms to develop Multi Model Ensembles (MMEs) based on Regional Climate Models (RCMs) within the Esca River basin, situated in the high mountain region of the Pyrenees. A comprehensive ranking of the RCMs was established, revealing substantial variability in performance across individual variables and seasons, with MPI-driven RCMs consistently outperforming others.

To determine the optimal number of RCMs for MME construction, a top-ranked approach was adopted. Seven RCMs were selected based on performance curves analysis, forming the definitive MMEs.

Noteworthy enhancements were observed in precipitation representation on both annual and seasonal scales by the Machine–Learning (ML) based MMEs. Although the results obtained for temperatures using ML-based MMEs are more subtle at seasonal scale, a relevant improvement is observed in the annual RMSE values. Hydrological simulations employing MMES of climate variables based on Random Forest, Linear Regression and Gradient Boosting yielded outcomes comparable to those fed by climate observations, significantly outperforming simulations based on single RCMs and SEM. Our results showcase two key findings. Firstly, they highlight the potential of machine learning techniques in constructing MMEs to enhance the characterization of climate variables. Secondly, they underscore the advantages of utilizing these ML-MMEs as input data for hydrological models.

Additionally, our methodology showcased versatility by applying algorithms to climate projections under the RCP8.5 scenario, providing more realistic information than traditional methods and thereby offering opportunities for reducing uncertainty in climate outputs for adaptation planning and basin-scale impact analyses in the context of climate change. This contribution holds particular significance and novelty in a region characterized by complex topography, such as the high mountain region of the Pyrenees, where predicting future changes is not only a complex task but also essential for the climate change adaptation of the region.