1 Introduction

To reduce observational uncertainty and the impact of projected changes in climate and catchment hydrologic variables on the availability of water resources for an accurate assessment of sustainability concerns at local, regional, and global scales, a successful hydro-climatic study requires accurate data with high temporal and spatial resolution. Therefore, future variations of global surface air temperature and precipitation are deemed important for climate change policy formulation, and hazard assessment which may affect the human livelihood and regional economy, especially in developing economies (Akhter et al., 2017) and ensure proper mitigation and adaptation strategies are adopted for integrated water resources management (Ahmed, Sachindra, et al. 2019a).

Global energy balance has undergone periodic changes due to increased emission of atmospheric greenhouse gas owing to widespread fossil fuels usage and industrial activities across the globe (Chu et al., 2010; Huang et al., 2011; Salman et al., 2018). The rise in the concentration of greenhouse gases has been noted to increase the Earth’s temperature at a rate of 0.15 °C/decade (IPCC, 2013), and has potentially a severe effect on the earth’s ecosystem and most notably in the tropical regions (Wang et al. 2014; Mohsenipour et al. 2018; Khan et al. 2019). (Mishra and Liu, 2014) observed that the impact of the variations of rainfall-induced by projected climate change would be more intense in the tropical regions of the world. This phenomenon has become a significant socio-economic and political issue (Alamgir et al., 2019).

General circulation models are the main tool for predicting future changes in the global climate (Maraun et al., 2010) because they possessed the potential in replicating historical climatic changes as well as future changes considering the greenhouse gas concentration (Goyal et al., 2012) and other shared socio-economic pathways integrated into the recently developed coupled model inter-comparison project phase six models. Evaluating the performance of general circulation models (GCM) outputs is instrumental for simulating the historical and future basin scale hydrologic cycles using hydrological models. The accuracy of the precipitation and temperature outputs from GCMs, when used as an input for hydrological modelling, affects the reliability of hydrologic variables prediction, and therefore the spatial and temporal performances as well as their seasonal variations need to be evaluated arising from natural and climate-induced radiative forcing in a multi-model context (Eyring et al. 2016; Gusain, Ghosh, and Karmakar 2020; Wang et al. 2021) to reduce uncertainties and enhance prediction accuracy for reliable policy planning of basin scale hydrology.

GCMs are developed to produce projections at a coarse spatial scale and could not resolve finer scale features such as clouds and land use change. Studies at the regional scale require that the output be downscaled to finer a resolution. The mismatch between the hydro-climatic information required and the GCM output is a major cause for concern or obstacle in hydrologic impact studies (Willems and Vrac, 2011). Careful assessment of various downscaling techniques is important to enable accurate prediction and a solution to the mismatch between regional hydro-climatic information and GCM outputs at a spatial scale of between 5 and 50 km (Yang and Delsole, 2012). However, McSweeney et al., (2015) argued that downscaling the complete ensemble may not be desirable or necessary to produce a meaningful range of future climate conditions relevant to evaluate hazards associated with future climate change because high resolution downscaling is labour and computing resource intensive and therefore, various strategies need to be explored to sample from the available CMIP6 GCMs and their shared socio-economic pathways (SSPs) scenarios to generate projections relevant for water policies at catchment scale. The assessment process is to identify the challenges and opportunity to exclude models deemed unsatisfactory in the representation of key climate features.

These models have been utilized to simulate the historical and projected changes in climate at global (Sachindra et al., 2014; Wright et al., 2015), regional (Salman et al. 2018; Ahmed, Sachindra, et al. 2019a; Abbasian, Moghim, and Abrishamchi 2019) and local basin scale (Akhter et al., 2017; Hassan et al., 2020). However, over the years, six phases of the Coupled Model Inter-comparison Projects (CMIPs) have been developed by various modelling centres for climate change studies. Previous studies utilizing the fifth phase (CMIP5) GCMs indicated some significant improvements in simulated climate variables relative to the latter generation (CMIP3) models (Wang et al. 2016), due to improved knowledge of climate science. Studies that utilized CMIP5 GCMs have identified the Sahel region, tropical West Africa and Southern part of Africa as hotspot for severe impact of regional climate change (Diffenbaugh and Giorgi 2012; Niang, ,Ruppel, Abdrabo, Essel, Lennard, Padgham 2014). Although studies are evolving utilizing CMIP6 models recently to understand their ability in replicating historical and future climate at the regional and global scale. There are notable differences between the CMIP6 GCMs and the earlier phases which integrate new specification for greenhouse gas concentration and emission scenarios, as well as land use scenarios (Gidden et al., 2019). A limited number of studies utilizing the CMIP6 GCMs indicated improvement and its robustness over earlier phases in some regions, for example Australia (Grose et al., 2020); Africa (Almazroui et al., 2020; Ayugi et al., 2021; Sian et al., 2021); Nigeria (Shiru and Chung, 2021); and South Korea (Song et al., 2020). The performances may vary from one region to another and therefore, necessitate the evaluation of their performance at any region of interest, especially in data sparse catchments before adoption for proper representation of the climatic feature to avoid projections with large uncertainty range which may result to over confidence and poor adaptation (McSweeney et al., 2015).

Several techniques have been used to assess the performance of climate models such as ensemble averaging (Giorgi and Mearns, 2002), combined statistical measures like root mean square error, mean bias error, mean absolute error, correlation coefficient into one performance index (Gu et al. 2015; Wu et al. 2017), relative entropy (Shukla et al., 2006), probability density functions (PDF) (Perkins et al., 2007), Bayesian regression (Chandler, 2013), clustering (Knutti et al., 2013), correlation (Li et al., 2016; Xuan et al., 2017), and symmetrical uncertainty (Salman et al., 2018) and often times the methods are quite cumbersome or produce inconsistent performance across climate variables that will require a multi-criteria decision tool for selection of appropriate GCMs for hydrologic impact studies.

However, based on previous studies, no consensus was made regarding the choice of GCM selection approach and McMahon et al., (2015) posited that no single criterion is accepted universally for GCM performance, although assessment at multiple time scales and the ability of a GCM capturing the spatial structure of a catchment’s key climate feature may give crucial information for water resource management accurately, especially at basins with high climate variability (Gleckler, Taylor, and Doutriaux 2008; Ahmed, Sachindra, et al. 2019a). The downside of the statistical metric is that they only assessed certain features of the time series data when compared to the observed data (McSweeney et al., 2015; Weigel et al., 2010), and often provide contradictory results across different metrics (Nashwan and Shahid, 2019; Raju et al., 2017).

Studies on GCM selection are grouped into the past performance method (Biemans et al., 2013), where the models are selected based on their capability to replicate the historical climate and the envelope method (Warszawski et al., 2014) where the ensemble of GCMs are selected based on the ability to encompass the whole range of future projections. However, both of these approaches have weaknesses; for example the former method neglects agreement between GCMs to simulate projected future climate, while the latter method ignores the ability of GCMs to replicate the historical observed climate (Ahmed et al., 2019b).

Machine learning algorithms are gaining a lot of attention, especially filters and wrappers such as clustering (Knutti et al., 2013; Raju and Kumar, 2016), Bayesian weighting (Min and Hense, 2006), and weighted skill score (Maxino et al., 2008; Perkins et al., 2007). Furthermore, these techniques are used to evaluate and select the GCMs using a single performance index (Ahmadalipour et al., 2017; Fu et al., 2013; Raju and Kumar, 2016; Wójcik et al., 2014). However, the techniques are found to have some inherent weaknesses such as the inability to capture the temporal variability of climate, and, the variation of frequency of climate extreme which is found to be an important factor in the assessment of model performance (Salman et al., 2018).

The major advantages of machine learning techniques are that, in feature selection where the dependent variables are analysed and ranked based on their importance or impact on the independent variable, and where features that are likely to decrease the efficiency of a model are screened out to reduce uncertainty in model development. It is important in machine learning application to have the observed data well represented and the information within the data series possessed the ability to be learnable for the modelling process, as this is critical for the final performance (Kursa, 2016).

In GCM selection for hydro-climatic study, the information entropy based filter referred to as symmetrical uncertainty (SU) (Witten et al., 2005) has gained the attention of researchers due to its ability to select variable without bias and reliably. The technique was used to rank GCMs according to their degree of similarity or otherwise with the observations for the entire time series data and it has the advantage of giving a universal metric for the relationship between dependent and independent features irrespective of the shape of the underlying distributions (Wu and Zhang 2004) and has been used in various studies (Ahmed et al., 2019b; Nashwan and Shahid, 2019; Pour et al., 2018; Salman et al., 2018).

The technique has shown to be promising and possessed a similar or better skill for performance evaluation and selection of appropriate GCM dataset for hydro-climatic study as compared to other available methods such as compromise programming, wavelet-based skill score (WSS), statistical metrics which sometimes exhibits contradictory results and makes decisions on performance and selection difficult (Ahmed, Shahid, et al. 2019b), although most filter based classifiers are known to use single algorithms to integrate variable selection and modelling and often evaluate univariate or very simple interactions between attributes and decisions, which affects the outcomes or performances of features (Kursa, 2016).

However, machine learning approaches are evolving, for example, the wrapper-based Boruta algorithms for feature selection in model building and has been applied in other disciplines such as, Kursa (2016) showed the relative skill or effectiveness of the methodology in feature selection of random ferns classifier, Ahmed et al. (2021) applied the technique for soil moisture estimation under global warming scenarios. Advancement in computational capabilities has ensured that machine and deep learning techniques are useful for accurate variable prediction due to their ability to extract, process and handle relatively large amount of complex data with high degree of variable mapping skills and efficiency (Gong et al., 2019).

Numerous studies have successfully implemented different feature selection algorithms such as information entropy (Shukla et al., 2006), Bayesian weighting (Min and Hense, 2006), elastic net and ridge regression (Hammami et al., 2012), artificial neural networks (Hajnayeb et al., 2011), support vector machine (Maldonado and Weber, 2009), random forest (Genuer et al., 2010), neighbourhood component analysis for regression (Ghimire et al. 2019a), and iterative input selection algorithm (Prasad et al., 2017) for hydro-climatic studies. Among the techniques, symmetrical uncertainty (SU) that is based on information entropy (William et al., 1996) is popular and has gain the attention of researchers (Kannan and Ramaraj 2010; Nashwan and Shahid 2019; Ahmed, Shahid, et al. 2019b; Sa’adi et al. 2020). However, there are identified shortcomings of the various method such as, inter-dependencies among the models are ignored for a known feature which may result in the selection of inappropriate GCMs due to overfitting and some performance indices are based on the state of the climate as a whole and temporal variability is not considered which is critical in model performance assessment (Reichler and Kim 2008; Ahmed, Shahid, et al. 2019b; Hassan et al. 2016).

This study employed and pilot the use of Boruta-random forest algorithms (BRF) developed by Kursa et al., (2010) for performance evaluation and selection of appropriate CMIP6 GCM models to examined their individual capability to accurately simulate observed daily historical climate variables in the basins due to its high sensitivity to climate change, and in this case, Lake Chad basin was adopted to examined the robustness of the methodology due in part to its highly skilled variable mapping as a requirement for input parameters required in hydrologic study, uniqueness and high climate variability across the basin of interest and this technique has been used and recommended based on previous study for example, Prasad et al. (2019) utilised BRF to predict soil moisture, Christ et al. (2016) applied BRF for industrial big data application in distributed and parallel time series feature extraction, Leutner et al. (2012) predict forest biodiversity and Lyu et al. (2017) applied the concept to forecast air quality, where these algorithms were used to define significant input parameters by comparing the real features or variables to those of random probes and all the studies have suggested a convincing outcome of model accuracy. However, the authors acknowledged that, this is first attempt to have applied the proposed technique for GCM evaluation and selection and to validate the methodology, an information aggregation approach was adopted to combine the ranks of the GCMs across the grid points in the entire basin to identify the best ensemble of the CMIP6 GCM for simulation of the above variables. The result of this approach will be compared with the well-established symmetrical uncertainty technique to understand the efficacy, applicability and the inherent uncertainties of the model evaluation and selection approach.

Furthermore, earlier studies that utilized simulation based studies to investigate the impact of climate change on freshwater hydrology required data in the form of daily precipitation and temperature series to drive various hydrological watershed models; however, the performance of the model is dependent upon the driving general circulation models, internal parameterizations and model domain configuration (Déqué, 2007).

Therefore, it is important to note that conclusions drawn from studies that evaluated and select GCM performance at monthly time scale (Salman et al. 2018; Abbasian, Moghim, and Abrishamchi 2019; Ahmed, Shahid, et al. 2019b) may well not be relied upon for studies that requires climate data at daily time step for hydrologic modelling processes, in order to reduce biases and observational uncertainties for realistic predictions of climate change impact, adaptation and resilience in hazard assessment. The performance of GCMs from BRF, SU and an all-ensemble average approach adopted in this study will be evaluated for a realistic assessment of basin scale features and climate variable dynamics.

Finally, this study intends to propose and provide a robust approach that enhance and preserve climate signals’ internal parameterization exerted by re-gridding, downscaling and bias correction to realistically utilize the optimal amount of GCM dataset capable of assessing the complex interactions within the Earth system (hydrologic) models which are essential for accurate understanding and forecasting of long-term changes of basin hydrology resulting from external forcing which are not adequately addressed in previous research especially in data-sparse regions within an acceptable level of uncertainty.

2 Case study area and data

2.1 Case study area

The Lake Chad Basin is one of the world’s largest endorheic basins, with an estimated area of ~2,500,000 km2, (Coe and Foley, 2001; Gao et al., 2011). It is situated at latitudes of 5.20–25.30 N and longitudes of 6.90–24.50 E, in the transition zone between the Sahara and the Sudano-Sahelian regions of West Africa (Ndehedehe et al., 2018), (Fig. 1). The basin provides the primary source of freshwater for livestock grazing, agricultural production, and fish farming (Buma et al., 2016). The basin is divided into four climatic zones namely the Saharan, Sahelo-Saharan, Sahelo-Sudanian, and the Sudano-Guinean zone, with an annual precipitation range of < 100–1500 mm respectively, and has an average annual temperature of between 35 and 40 °C in the northern part to as low as 26.5 °C in the southern part of the basin (Nkiaka, Nawaz, and Lovett 2018a) characterised by hot, wet and dry weather condition from March to June, from June to October, and from November to February, respectively (Mahmood et al., 2019). The basin is situated in an area with minimal relief, no surface outflow, and a spatial extent that is highly vulnerable to climate change, with elevations ranging from –330 to 3446 m (see Fig. 1), However, according to (Coz et al. 2009; Nkiaka, Nawaz, and Lovett 2018b), with the exception of few isolated hills, plateaus, and mountains in the southern and northern regions, the basin is generally flat ground with an average slope of < 1.3%.

Fig. 1
figure 1

Lake Chad basin showing elevation and climate stations for the proposed study

2.2 Gridded data and sources

Several gridded climate data sources were explored and analysed for suitable application in climate studies in the Chad Basin due to inadequate gauged based meteorological data in most part of the world Sub-Saharan Africa and the Mediterranean, in particular. This study utilized the daily gridded precipitation data of the US Climate Prediction Centre (CPC), optimal interpolation of station or gauged based records of GTS (Xie et al., 2007) and daily maximum and minimum temperature data of the Princeton University Global Meteorological Forcing PGF v.2 from forcing’s of NCEP-NCAR reanalysis and other global data by bilinear interpolation (Sheffield et al., 2006), from 1979 to 2012 and available at https://www.esrl.noaa.gov/psd/data/gridded/data.cpc.globalprecip.html and http://hydrology.princeton.edu/data.pgf.php respectively, as recommended by Lawal et al., (2021) to represent adequately the true climatic features of the selected basin.

2.3 General circulation model data and sources

In this study, 16 coupled inter-comparison project phase 6 (CMIP6) general circulation models at daily time resolution were considered. These are available at https://esgf-node.llnl.gov/projects/esgf-llnl/CMIP6 for the period 1979–2012 and consistent with the gridded data time scale for realistic performance evaluation. Details of the GCMs name, spatial resolutions, models’ development centres and country of origin are provided in Table 1. The first ensemble members of the GCM were adopted for fair assessment. The models were chosen because they have the requisite historical and future climate change emissions scenario data (SSP1–SSP5) at a daily timestep as an important requirement for a hydrologic modelling study. This is essential for understanding watershed projected hazards due to climate change and provide valuable information to policy makers for informed decision on integrated water resource management.

Table 1 Summary of CMIP6 models considered in this study

3 Research methodology

3.1 Methodology flow chart

The general procedure used for the evaluation of GCMs using gridded CPC and PGF climate data as surrogate for observed, selection, and further analysis of GCM ensemble is outlined in the methodology flow chart as shown in Fig. 2.

Fig. 2
figure 2

GCM evaluation and selection for hydro-climatic study

3.2 Statistical downscaling and bias correction of GCM

A multi-point statistical downscaling and bias correction approach was considered. The CMIP6 daily precipitation, maximum temperature and minimum temperature GCM models were interpolated to a common 2° × 2° grid and spatially downscaled using bilinear interpolation approach for smooth transformation of coarse to fine resolution as recommended by Fischer et al., (2014) resulting into 54 grid point (Fig. 1), that covered the entire study area.

The deviation of the interpolated climate data was corrected to improve the GCM models agreement with the observed data (CPC precipitation and PGF temperature), and to provide enhancement in model output because direct GCM output cannot be relied upon for accurate assessment of watershed climate features at local and regional scale required for impact studies due to their coarse resolution.

This is achieved by combining features of local observation and simulations resulting in insignificant biases and higher resolution climate projections using three well-known bias correction approaches and the detailed methodologies can be found in the cited references namely delta change (Navarro-Racines et al., 2020), quantile mapping (Cannon et al., 2015) and empirical quantile mapping technique (U. Ghimire et al., 2019b).

The grid based bias correction performance was evaluated using three statistical metrics e.g. correlation coefficient, mean bias error and index of agreement to understand the limitations of the various techniques to prevent misuse in selecting the best suited output relative to other method for further analysis and their detailed methodology can be found in (Taylor, 1997; Willmott, 1981; Willmott and Matsuura, 2005).

3.3 Performance evaluation and selection of appropriate GCM output by machine learning technique

The daily bias corrected CMIP6 GCM outputs adopted were evaluated by two machine learning approaches to discover the significant features of the GCMs relative to the observation data. This section gives a brief overview of the machine learning based symmetric uncertainty and Boruta random forest algorithms for performance assessment and ensemble projections.

3.3.1 Symmetric uncertainty

Symmetric uncertainty was developed by William et al., (1996), and an entropy-based filter that assess pair-wise similarity between dependent and independent attribute irrespective of their probability distribution and interdependency (Wu and Zhang, 2004). It measures the information gain of the response random variable relative to the predictor and the lesser the entropy the greater the association of the data. The bias in the measurement is corrected by dividing the information gain with the sum of entropies of the random features of the observation and GCMs precipitation, maximum and minimum temperature respectively. The symmetric uncertainty (SU) was estimated according to Eq. (1). Detailed methodology of the approach can be found in (Ahmed et al., 2019b; Lawal et al., 2021).

$$SU\left(x,y\right)=2\times \frac{MI\left(x,y\right)}{H(x)+H(y)}$$
(1)

In Eq. (1), MI(x, y) is the mutual information gain, H(x) and H(y) represent the entropies of the GCM and observation respectively. Symmetric uncertainty values range from 0 to 1 indicating no similarity to perfect similarity respectively.

3.3.2 Boruta random forest algorithm

Boruta feature selection was developed based on random forest algorithm (Breiman, 2001). The approach was introduced by Stoppiglia et al. (2003) to identify significant input parameters from a host of many dependent features to match the attributes of an independent feature. The algorithm computes the Z-score of all input predictors and the distribution determines the shadow features as well as the important variables of the predictors based on the Z-score metrics (Kursa and Rudnicki, 2010). The methodology adopted are shown below.

A duplicate or shadow random variable, \({q}_t^{\prime }\) for a particular vector, qt to increase randomness and correlation between predictors and target variable (Pt), for all group of discrete inputs (qt ∈ Rn), T and target variables (Pt ∈ R) for all inputs (n) and t = 1, 2, 3, …………T. The target variable is evaluated for association using the shadow variable \({q}_t^{\prime }\) and actual variable qt.

  • The variable importance measures or scores (mean decrease accuracy) for each predictor qt and shadow \({q}_t^{\prime }\) attributes were computed for 500 iterations based on Eq. (2)., to achieve better accuracy (Hur et al., 2017).

$$\kern3em \textrm{VIM}=\frac{1}{m_{tree}}\sum\nolimits_{m=1}^{m_{tree}}\frac{\sum_{t\in OOB}I\left({P}_t=f\left({q}_t\right)\right)-\sum_{t\in OOB}I\left({P}_t=f\left({q^n}_t\right)\right)\kern0.5em }{\left| OOB\right|}$$
(2)

I(•) is an indication function, OOB is out-of-bag predicted error in the training samples; Pt = f(qt) and Pt = f(qnt) are predicted values before and after permutation.

  • The standard score (Z-score) of the predictor and shadow attributes relative to the observation are computed as:

$$Z-\textrm{score}=\frac{VIM}{\upsigma}$$
(3)

Where, σ is the measured standard deviation of accuracy losses, the minimum, median and maximum Z-score of the shadow features is computed and analysed relative to the predictor variable importance distribution of all dependent features. When all input features are confirmed to be important, or the iteration limit is achieved, the algorithm ends.

  • The attribute is deemed important for a set iteration, if its variable importance score is higher than the maximum importance score of the shadow attributes.

3.4 Performance evaluation of selected GCM ensemble mean.

The weight or performance score of the two machine learning technique was used to rank the bias corrected precipitation, maximum and minimum temperature at individual grid point and the ranking were aggregated based on multi-criteria rating metrics using numerical averaging as recommended by Raju and Kumar (2016) for the entire study area.

Multi-model ensemble mean of the GCMs was developed by selecting four best GCM relative to the observation based on the two machine learning approaches (BRF and SU) and that of the 16 GCMs here referred as all model ensembles mean (AME) to further evaluate spatial and temporal changes in annual precipitation, maximum and minimum temperature changes, and extreme conditions of the basin climate. Analysis of individual model bias as seen in other studies may not necessarily translate to better performance. Here, the study focused on the ensemble mean of the GCM to better understand the uncertainty (spread) of the combined model ensemble mean features, which is important and sensitive in their usage for hydrologic impact studies and reliable future projections of water resources.

3.5 Validation of evaluation approach based on return period of drought

The evaluation approaches considered, i.e., BRF, SU and AME, were further validated by examining the influence of the ensemble mean GCM precipitation, maximum and minimum temperature on the severity and returned period of drought against observation across the four climatic zone of the basin by estimating the 12-monthly standardized precipitation evapotranspiration index (SPEI) for the period of observation (1979–2012). The estimated SPEI time series were further used to assess the temporal pattern in the trend and their significance by Mann Kendall trend test and Sen’s slope estimator at 95% level of confidence.

3.6 Standardized precipitation evapotranspiration index

Standardized precipitation evapotranspiration index is a phenomenon that depicts water surplus or deficit within long climatic time scale that the difference between precipitation and potential evapotranspiration is calculated and then fitted with probability density function to estimate the return period of flood or droughts in a catchment. The severity was classified as SPEI ≥ 2.0, 1.5 to 1.99, 1.0 to 1.49, 0.99 to – 0.99, –1.0 to –1.49, – 1.5 to – 1.99 and SPEI ≤ – 2.0 to indicate extremely wet, severely wet, moderately wet, normal, moderate drought, severe drought, and extreme drought event respectively (Shekhar and Shapiro, 2019). Validation by SPEI captures the impact of temperature increase on water demand and natural variability of climate and its influence on climate change studies. Further evaluation to observe the trend and its significance was based on the methodology by (Henry, 1945; Kendall, 1948) and (Sen, 1968) given in the given equations.

The Mann-Kendall statistics is:

$$S=\sum\nolimits_{k=1}^{n-1}\sum\nolimits_{j=k+1}^n\operatorname{sgn}\left({x}_j-{x}_k\right)$$
(4)

For a n number of time series, xj and xk are consecutive data values. The series sgn is as follows:

$$\operatorname{sgn}\left({x}_j-{x}_k\right)=\left\{\begin{array}{c}1\ if\ {x}_j>{x}_k\\ {}0\ if\ {x}_j={x}_k\\ {}-1\ if\ {x}_j<{x}_k\end{array}\right.$$
(5)

The mean E(S), variance V(S) and the Z statistics is evaluated as:

$$E(S)=0$$
(6)
$${\displaystyle \begin{array}{c}V(S)=\frac{1}{18}\left\{n\left(n-1\right)\left(2n+5\right)-\sum_{i=1}^p{t}_i\left({t}_i-1\right)\left(2{t}_i+5\right)\right\}\\ {}Z=\left\{\begin{array}{c}\frac{S-1}{\sqrt{V(S)}}\ for\ S>0\\ {}0\kern1em for\ S=0\kern3.5em \\ {}\frac{S+1}{\sqrt{V(S)}}\ for\ S<0\kern0.5em \end{array}\right.\end{array}}$$
(7)

Because the nonparametric Sen's slope estimator is resilient against outliers in time series analysis, it was used to calculate the size of the discovered trends in the time series data:

$$\textrm{SS}=\textrm{median}\left[\frac{x_j-{x}_i}{j-i}\right]\kern0.75em \textrm{for}\ \textrm{all}\ i<j$$
(8)

Where xi represents the data value at a time step i and xj represents the data value at time step j.

4 Results and discussion

4.1 Spatial and temporal downscaling and bias correction of CMIP6 GCMs

Figure 3(a–c) shows the example of post-corrected precipitation and maximum and minimum temperature data of GCM output ACCESS CM2 relative to the observations at grid point 1 (15.950, 6.310), for 1979–2014 and 1979–2012 respectively, using delta change, quantile mapping and empirical quantile mapping technique.

Fig. 3
figure 3

Plot of bias-corrected GCMs and observed climate data using delta change, quantile mapping and empirical quantile mapping method in the Lake Chad basin. (a) Variation of GCM precipitation relative to observed CPC data. (b) Variation of GCM maximum temperature relative to observed PGF data. (c) Variation of GCM minimum temperature relative to observed PGF data

The evaluation results across the 54 grid point that covers the study area of the downscaled and the daily bias corrected GCM outputs of precipitation, maximum and minimum temperature indicated that delta change and empirical quantile mapping method is the most suitable for daily precipitation and maximum and minimum temperature with a recorded mean bias error, MBE = 0, mean correlation coefficient R2 = 0.8 and modified index of agreement, md = 0.86 at 94% and a MBE range of -0.01 – 0, mean correlation coefficient R2 = 0.92, and modified index of agreement md = 0.96 respectively, relative to other evaluation method across all the grid points. However, according to Fig. 3c, there are difficulties of downscaling models to capture peak values of minimum temperature across all evaluation methods, especially in the Sahelo-Saharan zone but this might be attributed to scale gap between GCM outputs and observations and such variations cannot be accounted for by the GCM outputs as corroborated in Sachindra et al., (2014). Hence, GCM outputs from this evaluation technique were adopted for the next phase of analysis. The evaluation was limited to statistical downscaling because it was considered to be more flexible than dynamical technique and its projections can be downscaled to point specific locations as corroborated in Martinez-García et al., (2021).

The merit of this approach is to provide a reference for evaluating the accuracy of precipitation and temperature, which are the two critical hydrologic model input parameters, to improve the predictions of future river basin hydrologic cycles. The bias correction technique performed much better in simulating the observation in the Guinean-Sudanian zone, with a cluster of grid points with almost perfect correlation relative to the observation. However, Saharan zone exhibit some inadequacies in replicating the observed climate features which may be due in part to poor climate signals of sparse precipitation events. In general, the result indicates some improvements of the CMIP6 GCM in capturing the observation signals of the climate variables and has proven to restore the inter-station dependencies. The multi-site approach illustrated the capability in addressing the inequities of transitioning and interpolation of GCMs from coarse to finer resolution and vice versa to accurately reproduce observed multiple statistical properties of climate variable for improved hydrological climate change impact studies.

4.2 Performance evaluation and selection of CMIP6 GCMs

The individual models were coded M1, M2, …...M16 to represent ACCESS CM2, ACCESS ESM1-5, …… NorESM2-MM respectively, as appeared in Table 1, for ease of identification. The performance of the model across individual grid point was analysed based on the two machine learning approaches SU and BRF to investigate their significance in replicating the basin climatic features of precipitation and maximum and minimum temperature for the period 1979–2014 and 1979–2012 respectively, which was a compromise and limitations between CPC, PGF and historical CMIP6 GCM based on world meteorological organisation climatological standard normal period of data range and availability.

The GCMs were ranked based on their symmetric uncertainty (SU) coefficient (Section 4.2.1) and variable importance score (Section 4.2.2) at each grid point for all the climate variable to understand the degree of association or relative skill (importance) of the models with insignificant bias across the basin respectively. However, in BRF analysis, a zero score was recorded for GCMs whose variable importance score is less than that of the shadow attributes at any grid before aggregation and ranking for consistency.

4.3 Evaluation based on symmetric uncertainty.

The result of symmetric uncertainty for all the climate variables were aggregated across the 54 grid points considered, which was a fair representation of the basin as shown in Table 2. The rankings based on the coefficients indicated that GFDL-ESM4, MIROC6, INM-CM4.8, ACCESS ESM1-5, and MRI-ESM2.0 exhibit a better skill for precipitation with coefficient in the range of 0.94–0.81, while MPI-ESM1.2-LR, INM-CM4.8, EC-Earth3-Veg, and EC-Earth3 are relatively better in simulating maximum temperature with coefficient in the range of 0.98–0.92 and finally, MPI-ESM1.2-LR, MIROC6, NorESM2-MM and CanESM5 are relatively better in simulating minimum temperature with a recorded coefficients in the range of 0.84–0.85. However, all the CMIP6 GCMs exhibit an above average skill except for F-Goals-g3 which was quite poor for precipitation. Varied level of GCM performance was noticed which further complicates evaluation and has been acknowledged by (McMahon et al., 2015), that GCMs have strength and weaknesses in simulating different climate variables. Simulation result of CMIP6 GCM by symmetric uncertainty exhibits some improvement in contrast to earlier phase as in (K Ahmed et al., 2019a), which may be attributed to differences in timescale and timesteps of the chosen GCMs, improvement in model parameterizations and development and quality of observation data. This is corroborated in studies by (Ayugi et al., 2021; Grose et al., 2020; Wang et al., 2021).

Table 2 Summary of GCM SU coefficients of precipitation, maximum and minimum temperature relative to observation data

4.3.1 Evaluation based on Boruta random forest algorithm

The performance of the GCM models was assessed based on its ability to iteratively identify the importance of the original attributes (CMIP6 GCMs) with their randomised sets (shadow attributes) to truly replicate the observation data. The simulation and importance measure were generated, and the variables were ranked as shown in Fig. 4a–c for grid 1. The ranking indicates that model M9, M13, M10, M14, M4, M11 and M7 are quite important in simulating GCM daily precipitation with variable importance score in the range of 4–22, while others whose important score are below that of the maximum importance score of the shadow attributes are considered poor in simulating the observed (CPC) daily precipitation. However, all GCMs exhibit a good skill in simulating maximum and minimum temperature with a significant difference in variable importance score in the range of 18.0 54.2 and 19.1–44.6 respectively. The procedure was repeated across all grid points and attributes were filtered or rejected with a performance score below the shadow attributes. The variable importance score was aggregated across all points and were ranked based on descending order of mean importance score for all the variables as shown in Table 3.

Fig. 4
figure 4

Box plot of variable importance score of GCMs relative to observed climate data using BRF. (a) Relative importance of GCM precipitation to observed CPC data. (b) Relative importance of GCM maximum temperature to observed PGF data. (c) Relative importance of GCM minimum temperature to observed PGF data

Table 3 Summary of GCM importance score of precipitation, maximum and minimum temperature relative to observation data

The evaluation of the climate variables by the novel Boruta random forest revealed a consistent performance of the GCMs across the grid points which makes the technique quite efficient in the ranking process and necessary for holistic assessment where the minimal optimal set of GCMs might be more useful rather than the application of the entire set of available models which may be computationally intensive, require more resources and time and decrease model proficiency.

The rankings based on Boruta algorithm indicated the difficulty of a single GCM to reliably simulate the daily observed precipitation across all grid points satisfactorily, although some GCM have a relatively better performance and are quite consistent across the grid points for maximum and minimum temperature. Figure 5a–c showed the spatial spread of the GCM performance and the aggregated importance score in Table 3, indicated that INM-CM4.8, MRI-ESM2.0, INM-CM5.0 and F-Goals-g3 exhibit a relatively better skill in replicating observed precipitation, while INM-CM4.8, MPI-ESM1.2-LR, INM-CM5.0 and MIROC6 for maximum temperature and INM-CM5.0, INM-CM4.8, MPI-ESM1.2-LR and MRI-ESM2.0 for minimum temperature respectively.

Fig. 5
figure 5

Spatial spread of GCMs performance relative to observed climate data using BRF. (a) Ranking of spatial spread GCM precipitation relative to observed CPC data. (b) Ranking of spatial spread GCM maximum temperature relative to observed PGF data. (c) Ranking of spatial spread GCM minimum temperature relative to observed PGF data

GCM models evaluated across the grid points indicated that a few are quite consistent in replicating the observation of the three climate variables; for example, INM-CM4.8 and INM-CM5.0 have shown to exhibit a better skill in simulating the observed daily precipitation at 24.1% and 18.52% of the grid points, while INM-CM4.8 and MPI-ESM1.2-LR at 35.19% and 16.67% for daily maximum temperature and finally INM-CM5.0 and MPI-ESM1.2-LR at 53.70% and 25.93% for daily minimum temperature respectively.

4.4 Identification and evaluation of multi-model ensemble mean of GCMs

The GCMs evaluated by the approaches considered were a precursor in understanding their skills necessary to match observations. However previous studies, for example Kim et al., (2016) have shown that uncertainties in climate projection can be reduced by identifying and adopting GCMs with better performance for impact assessment studies. Earlier literature such as Weigel et al., (2010) and Miao et al., (2012) recommended the use of a collection of GCMs ensemble mean to optimize reliability in prediction and minimize uncertainty in climate variable assessment. In this study, four best GCM were selected after re-aggregation (Table 4), due in part to the significant difference observed in the aggregate importance score value between model M14 and M12 in Boruta random forest evaluation to form the multi-model ensemble mean herein referred to as SU and BRF and a combination of the 16 GCM model referred as AME and were further analysed and validated for spatial pattern of precipitation and temperature and return period of drought for the study period and their implication for hydrologic modelling.

Table 4 Summary of GCM overall ranking based on aggregated SU and IS of climate variables relative to observation data

The result of the overall ranking indicated that MPI-ESM1.2-LR, MIROC6, INM-CM4.8 and NorESM2-MM are quite suitable from symmetric uncertainty approach, while INM-CM4.8, INM-CM5.0, MPI-ESM1.2-LR and MRI-ESM2.0 for Boruta random forest approach and are limited to four GCMs to others, due in part to the significant difference in their skills from the rating metrics scores in Table 4.

4.4.1 Spatial and temporal pattern of precipitation and temperature of GCM ensemble mean

The spatial correlation and pattern of the GCM ensemble mean annual precipitation and temperature was used to validate and measure uncertainty range of the three different approaches relative to the observation as in Fig. 6a–b and Fig. 7a–b for the period 1979–2012. The result of the spatial correlation between the mean annual precipitation and temperature indicated that BRF is consistent with the observation having a correlation value in the range of 0.641–0.9995 (0.9991) and 0.4423–0.8345 (0.7136) respectively. Evaluation from SU and AME is quite satisfactory; however, there are mismatches or poor correlation observed, especially in the Sahelo-Sudanian zone with a recorded spatial correlation as low as 0.15 and 0.19 for SU and 0.01 and 0.36 for AME relative to the observations for annual mean precipitation and temperature respectively.

Fig. 6
figure 6

Comparison of spatial correlation of GCM ensemble mean performance relative to observed climate data. (a) Annual precipitation. (b) Annual temperature

Fig. 7
figure 7

Comparison of the spatial pattern of GCM ensemble performance relative to observed climate data. (a) Annual precipitation. (b) Annual temperature

The results obtained in the temporal assessment of the evaluated approaches have shown that the correlation between the ensemble mean temperature are similar and quite skilful with R2 = 0.984 and a mean bias of 0.49 °C, 0.49 °C and 0.50 °C for BRF, SU and AME respectively, however, the BRF approach indicated a better temporal correlation of 0.95 and an annual mean bias of 0.638 mm/year as compared to SU and AME with spatial correlation of 0.82 and 0.88 and annual mean bias precipitation of 68.19 mm/year and 10.57 mm/year respectively. The biases are found to be significant and visible in the south-western part of the basin as seen in the grid-based analysis of the annual precipitation in Fig. 8 shown below, where significant deviations are noticed between grid point 1 to 10 for SU and 1 to 26 for AME, while BRF showed almost a perfect match relative to the observation across the grid points.

Fig. 8
figure 8

Variation of the temporal pattern of mean annual precipitation across grid points for 1979–2012

4.4.2 Assessment of climate extremes events of GCM ensemble mean

The temporal analysis of climate extreme indices was assessed at 12-month time step by the BRF, SU and AME and compared to the observation at the four climatic zone of the basin to understand the relative skills in predicting the pattern and frequency of extreme event and shift in trend within the study period in the four climatic zones. The result indicates an inherent shift (Fig. 9a), relatively in wet climate (1980–1998) to a transition from moderate to extreme droughts (1999–2012) in the Saharan zone with the frequency of SPEI values of the BRF, SU and AME approach consistent with the observation at 53.3%, 45.5% and 48.9% of the time respectively. Results of the SPEI values of the approaches relative to the observation (Figure 9b–d), in the Sahelo-Saharan zone is 48.99%, 53.28% and 47.22%, Sahelo-Sudanian zone is 45.2%, 45.96%, and 43.43% and Sudano-Guinean zones is 48.23%, 50.25% and 39.14% respectively. However, the statistical trends based on multi-year SPEI indices for the period 1980–2012 in the four climatic zones of the basin indicated that the BRF approach captured the extreme event direction quite accurately relative to the observation as seen in the z-statistic values within the same trend envelope Table 5, although all the approaches showed a satisfactory result in predicting the trend direction but there are under-estimation in the magnitude of the extreme event by the SU (− 0.0013) and AME (− 0.0033) approach in the Saharan zone which indicate an insignificant shift from wet to drought events (drying trend) as against the trend magnitude exhibited by the observation (− 0.0057) which is consistent with the BRF (− 0.0058) approach. The approaches showed difficulty in predicting magnitude of the trend in the Sahelo-Sudanian zone which indicated an over-estimation relative to the observation with statistically significant wetting trend and the magnitude in the order of 0.0031, 0.0038 and 0.0057 for BRF, SU and AME respectively, as against observation (0.001). However, the deviation is more pronounce in the AME approach and it is observed that there is a frequent and consistent shift in trend from wetting to drying period across the climatic zone and is consistently captured by the BRF approach relative to the observation. Bold values in Table 5 indicate consistent agreement between simulated GCM ensemble mean and observation that better represent the basin climatic features and the outcome will be suited to limit the magnitude of uncertainties and accurate hydroclimatic hazard representation in impact studies.

Fig. 9
figure 9figure 9

Temporal variations climate extreme event of GCMs ensemble mean performance of relative to observation using BRF, SU and AME approach in the Lake Chad basin. (a) Saharan zone. (b) Sahelo-Saharan zone. (c) Sahelo-Sudanian zone. (d) Sudano-Guinean zone

Table 5 Mann Kendall Z-statistic of linear trend of extreme event for the period 1980–2012

A trade off was created in the evaluation process to justify the selected model performance using the two techniques irrespective of climate variable of interest and a varied level of performance was noticed from one grid point to another. The ensemble mean approach was quite essential because it led to reduced biases and their combinations emphasizing on few models with good performance are required and this is particularly important in watersheds with sparse observed climate data and high climate variability.

The BRF approach has shown to be promising in the evaluation with a recorded lower bias in the temperature and precipitation and a more accurate representation of the magnitude, pattern, and trends of extreme event. Overestimation or considerable bias was observed by the SU and AME approaches in the southwestern part of the basin. The associated uncertainties can be evaluated further by considering the sensitivities of ensemble with alternative metrics or combination of approaches. However, the BRF approach has shown to be quite robust in evaluating the integrity of the regionalisation of GCMs over different timescale that exhibit good performance during the baseline period and their combination may likely be valid to represent the future period under climate change scenarios with certainty. The weighing scheme developed in this study is an exploratory framework that can be tested in various watersheds of interest for better water policy planning.

5 Conclusion

General circulation models are important and provides a pathway for simulation and assessment of the perceived impact of climate change on local, regional, and global hydrology. However, the choice of GCM input data, interpolation and downscaling method, timescale and timesteps are essential and critical for effective and accurate representation of the past, present, and future basin hydrologic process for fair and equitable river basin management and policy planning for sustainability.

Earlier studies suggest that most evaluations ignored the inter-dependencies among models of a known variable and could create over-fitting problems. This study is based on an ensemble of 16 CMIP6 GCMs at daily timestep evaluating the efficacy and robustness of the state of the art Boruta random forest algorithm technique has shown this to be a viable tool for selection of relevant models to reduce redundance, complexity and over-fitting problems associated with climate models to ensure sufficient overlap of chosen models ensemble mean with observations, seeking to limit the drawbacks encountered from existing techniques such as but not limited to inability to analyse complicated inputs, stochastic aspects, and climatic and hydrologic properties that are intricately interrelated, reduce the transfer of uncertainties into hydrologic processes and address critical temporal and spatial behaviour of the climate variables as a precursor for reliable and accurate predictions.

Highlights from the study revealed that there are inherent weaknesses associated temporal and spatial downscaling techniques and multiple techniques should be tried and examined to limit uncertainty range and inadequacies of GCMs, because exploring different multi-site downscaling techniques is very important in increasing the effectiveness of GCMs performance, combination of appropriate GCMs can enhance spatial and temporal variability to accurately reproduce observed multiple statistical properties of climate variables for improved output and reduced uncertainty in hydrologic modelling at regional and local basin scale. The selected models from Boruta random forest technique adequately have the capability in reducing biases in precipitation compared to the other approaches, although similar performances were observed in terms temperature and can capture the trends, patterns, and magnitude of extreme events within the accepted confidence limit.

The findings associated with this study are generally not meant to be a process to identify viable GCM dataset suitable for hydroclimatic study, but also to present a simple and efficient methodology to examine the limitations associated with the selected GCM ensemble for impact study. Therefore, the methodology proposed is not unique and therefore be explored to other basins for reliable representation of catchment climatology, representing a key step forward in GCM ensemble impact research.