1 Introduction

The demand for fresh water is increasing at an accelerated rate, despite restricted water resources becoming more susceptible to over-extraction, pollution, and the effects of climate change. This has compelled the need to implement an efficient water management system that provides equitable distribution while pursuing sustainable development goals (Zhongming et al. 2021). To offer cost-effective solutions for better management of water resources, developing predictive models that accurately estimate water quantity in particular, concerning precipitation and streamflow, is a pragmatic option (Block 2016). Understanding the correlation between teleconnections and local hydrological processes is a crucial notion in the domain of hydro-climatological forecasts (Alessandro 2018). Considering the influence of teleconnection patterns (TPs) on meteorological parameters, these patterns can also significantly affect dam inflow. Enhancing the precision of dam inflow predictions is essential to assist policy-makers and operators in managing their reservoir operations more effectively and making more informed decisions for integrated water resources management and effective flood risk mitigation efforts (Pini et al. 2020; Tran et al. 2021).

Recent machine learning (ML) and deep learning (DL) methods have shown success in hydrological forecasting, including streamflow prediction across daily to annual scales. While ML models have proven effective across various streamflow prediction scales, this study focuses on monthly inflow forecasting. This task presents unique challenges like capturing non-linear relationships, handling missing data, and adapting to dynamic conditions. However, ML models excel in these areas: their ability to learn complex relationships from data, handle inconsistencies, and adapt to changing environments makes them well-suited for monthly inflow prediction. Examples include successful applications of ML models for monthly streamflow forecasting in studies by (Awan and Bae 2013; Alizade et al. 2018; Babaei et al. 2019; Noorbeh et al. 2020; Wang et al. 2020a, b; Adnan et al. 2023; Akbarian et al. 2023). Furthermore, employing numerous DL models successfully for inflow prediction has been a highly effective approach, as evidenced by various trailblazing works, such as those by Apaydin et al. (2020), Herbert et al. (2021), Lee and Kim (2021), Choi et al. (2022), Feizi et al. (2022), Kim et al. (2022), Fan et al. (2023), and Han et al. (2023). As an alternative, Bayesian Neural Network (BNN) models have shown promising outcomes in predicting hydrogeological variables, including groundwater level (Maiti and Tiwari 2014), wind speed (Mbuvha et al. 2017), flow velocity of rivers (Ren et al. 2018), rainfall (Fan et al. 2022), and snow equivalent water (Vafakhah et al. 2022). Nonetheless, this model has not yet been applied to forecast the dam inflows.

Previous studies have utilized a plethora of techniques to identify the most beneficial input variables and the appropriate time lag. Wang et al. (2020a, b) used cross-correlation to identify relevant climate patterns (out of 130 examined) for monthly dam inflow forecasts in China, demonstrating the value of these indicators. Latif et al. (2021) focused on past inflow data for daily predictions in Malaysia, while Rachmawati et al. (2021) used nearby station data (with cross-correlation) for monthly forecasts in Indonesia. These examples highlight the prevalence of correlation-based methods. However, these techniques may miss non-linear relationships crucial for accurate AI models, which often exploit complex relationships between variables. Therefore, future methods that capture both linear and non-linear dynamics hold promise for improved forecasting accuracy. To address this shortcoming, our study employed Mutual Information (MI) for input variable selection. MI is a technique that can capture both linear and non-linear relationships between variables,, including non-monotonic interactions (where the relationship between variables can change direction). This makes it a well-suited choice for identifying informative climate indicators for dam inflow prediction with AI models (Wang et al. 2020a, b; Gowri et al. 2024).

Additionally, various investigations have examined the relationship between large-scale atmospheric patterns and various hydrological and meteorological elements. The discovered outcomes demonstrated that these patterns can influence various components of hydro-climatological systems across various timescales, including river flow as indicated by Rasouli et al. (2012), daily precipitation maxima as seen in Ruigar and Golian (2016), monthly rainfall in the context of Saligheh and Sayadi (2017), temperature as explored by De Beurs et al., (2018), maximum monthly flow as outlined in Linh et al. (2021), seasonal precipitation as seen in Helali et al. (2022), and groundwater level as revealed in Chu et al. (2022). Throughout recent times, these types of techniques have frequently been used to anticipate inflow; for example, Yang et al. (2017) leveraged such models to forecast inflow for 2 reservoirs in the United States and China, incorporating data derived from precipitation, evaporation, river flow, in addition to 17 climatic indices as inputs for Support Vector Machine(SVM), ANN, and Random Forest (RF) ML models. The outcomes indicated the potential of climatic indices and these models in predicting inflow. Furthermore, according to Banihabib et al. (2018), the runoff entering Dez Dam was anticipated through the creation of a Monthly Autoregressive Moving Average and Nonlinear Autoregressive Model with Exogenous (MARMA-NARX) hybrid model. The results indicated that the North Atlantic Oscillation (NAO) index proved superior to other indexes in influencing the selected dam inflows. In addition to the research mentioned above works, Kim et al. (2019) and Lee et al. (2020) predicted the inflow to various dams in South Korea by employing global atmospheric teleconnections. Also, Panahi et al. (2021) applied various TPs to anticipate the flow of river systems in Malaysia. Furthermore, research by Han et al. (2023) showcases that the application of TPs has a favorable impact on dam inflow forecasting. It has been suggested by the mentioned research studies that the application of TPs is an advantageous approach in the creation of inflow forecasts since its usage can enhance the precision of the prediction models.

Upon a review of the studies that have previously been undertaken, it has been observed that including TPs as indicators to anticipate river flow has the potential to expand the precision of the forecast model. Nevertheless, the precise category and impact time of these patterns and their influence on dam inflow in various areas remain ambiguous.

Consequently, the essential objective of this research is twofold:

  1. 1.

    Determine the most efficient atmospheric TPs and their most appropriate delay time in the context of the inflow to three dams in Iran.

  2. 2.

    Assess the potential of BNNs to elevate the precision of the monthly flow prediction model by incorporating uncertainty quantification. BNNs achieve this by providing probabilistic information about the predictions. We will employ the output distribution characteristics from the BNN to calculate the range of potential inflow values for the dams, offering valuable insights into the confidence of each forecast.

For this purpose, after choosing the most efficient climate indices and determining their ideal delay time, various scenarios comprising these indices, coupled with rainfall and flow data (which are the most prevalent and extensively employed data points for water discharge prediction) were utilized as inputs for the BNN forecasting model. BNNs offer advantages over traditional ANNs by incorporating uncertainty quantification, potentially leading to more reliable forecasts. Finally, for evaluating the benefit of leveraging a Bayesian approach in the training process of the artificial neural network model, the forecasts of dam inflows with ANN and BNN models have been evaluated in comparison with one another.

2 Study areas

Since Iran is a massive country with various climates, it is classified into six main river basins and 30 sub-basins based on watershed classifications. In this study, to figure out how global oceanic-atmospheric patterns influence reservoir inflow, the regions being studied have been selected encompassing diverse climatic conditions of Iran. Therefore, three reservoir dams located in three different secondary catchment areas have been investigated. It needs to be noted that yet another criterion for the choice of dams was the availability of precipitation data, in conjunction with the hydrometric station upstream of the dams during a prolonged statistical period (30 years or longer). Thus, the sites with inadequate or unavailable data were skipped. In Fig. 1, the locations of dams and chosen watersheds are illustrated.

Fig. 1
figure 1

Location of selected dams, rain gauges, and hydrometric stations in the present study

2.1 Zayanderood dam catchment area

The catchment area upstream of the Zayanderood Dam is in the range of longitudes 50.02 to 50.75 degrees east and between latitudes 32.30 and 33.20 degrees north. The region (4100 km2) is a section of the primary drainage zone of the Iranian Central Plateau in the larger division of the Iran hydrological zone. The Zayanderood River originates on the slopes of the Zagros Mountains, with its apex at an elevation of 3974 m. Its base can be found at the dam with a level of 1976 m above sea level, and the average altitude of the drainage system is 2477 m. Following the Köppen-Geiger method (Köppen and Geiger 1928), for the period extending from 1990 to 2014, the climate type for the upstream region of the Zayanderood Dam is classified under the category of "snowy climate with dry and very hot summers" (Dsa). This categorization is based on the data in the work by Raziei (2022). The outlet of this water source is located at the coordinates of 32.72 degrees latitude and 50.73 degrees longitude.

2.2 Amirkabir dam catchment area

The upstream region of the Amirkabir Dam is essential in political and financial terms due to its critical provision of drinking water and electricity for Tehran and Alborz provinces. The reservoir’s capacity is 101 MCM, and the lake behind it is 1.1 km long. The catchment area of Amirkabir Dam is wide in the geographic range of 51°3' to 51°35’ north latitude and 35°53’ to 36°10’ east longitude and it is located in the northern part of the Salt Lake basin. The size of the drainage basin studied in this study spans an area of 850 square kilometers. The pinnacle of the drainage basin is in the northeastern quadrant and has an elevation of 4407 m. On the other hand, the most inferior part is related to the riverbed at the dam site, with an elevation of 1710 m above sea level. The average annual rainfall of this area is 251 mm. The upper region of the Amirkabir Dam is characterized by a semi-arid and cold climate (BSk) according to the Köppen-Geiger classification for the period between 1990 and 2014 (Raziei 2022). The dam’s outlet is situated at the coordinates 35.95 degrees east longitude and 51.05 degrees north latitude.

2.3 Karun 3 Dam Catchment Area

The surface area covered by the Karun drainage basin in the Zagros mountainous range extends over a generous region of approximately 24,000 square kilometers. This area is positioned between latitudes ranging from 30.25 to 32.70 north and longitude corridors spanning from 49.52 to 51.98 east. The climate of the Karun Dam’s drainage zone is classified as moderate with dry and very hot summers, according to the Köppen-Geiger method for the time between 1990 and 2014 (Raziei 2022). The principal portion of the annual precipitation occurs throughout the winter season. The average annual rainfall in the catchment area of the Karun 3 Dam is determined to be 620 mm, with the snow catchment area spanning 2000 m above sea level, and its overall area covering roughly 16,000 square kilometers. The location of Karun 3 Dam is established at latitude coordinates of 31.72 north and longitude of 50.15 east.

2.4 Station variables

In this research, a rain gauge station and a hydrometer station, both situated in the region of the selected dams, have been employed to estimate the monthly inflow to each reservoir, as depicted in Fig. 1. The pertinent specifications of all stations used in the current research, categorized per location, are provided in Table 1. Furthermore, the inflow qualities for the locations as mentioned earlier, and the period for which statistical data is collected for each dam are listed in Table 2.

Table 1 Information of the Selected Stations
Table 2 Statistical characteristics of studied dams

3 Methodology

The primary components of the research process are depicted in Fig. 2, and each phase is elaborated on in the following manner:

Fig. 2
figure 2

Flowchart of the proposed approach for dam inflow prediction

3.1 Teleconnection patterns

Studies have revealed that TPs have played an instrumental role in the transformation of climactic variables across the globe (Belteram and Carbonin 2013; Salehizade et al. 2015; Steirou et al. 2017). TPs refer to a recurrent, persistent, extensive pattern of atmospheric pressure and general circulation modifications that span vast geographical areas. Although these patterns often endure for a few weeks to a few months, they can sometimes persist for prolonged periods; consequently, they constitute a crucial component in the inter-annual and inter-decadal variability of the atmospheric flow. The areas subject to TPs are distinctive. Some patterns extend over vast oceanic and continental zones, with a scope that spans the entire planet. Additionally, some patterns only influence the Northern Atlantic, others involve Eurasia, and some affect North America from the eastern seaboard to the central European areas (NOAA 2023). In this research, data corresponding to 8 TPs were gathered from NOAA PSL (2023), a physical science laboratory system run by the National Atmospheric-Oceanic Organization (Table 3).

Table 3 The names of the indices used in the present study

The following section presents key information regarding the indices utilized in this research, complete with relevant citations. NAO is a key TP across all seasons, characterized by a dipole of sea level pressure anomalies between Greenland and the mid-latitudes of the North Atlantic, derived from differences between the Azores and Iceland (Barnston and Livezey 1987; Cropper et al. 2015). PDO describes long-term fluctuations in Pacific Ocean surface temperatures, with phases lasting 20–30 years. A positive phase results in lower surface temperatures near Hawaii and warmer conditions along North America’s western coast (Mantua et al. 1997; Bond et al. 2003; Yin et al. 2021). PNA represents a major mode of low-frequency variability in the Northern Hemisphere. In its positive phase, it features high geopotential heights over Hawaii and the intermountain region of North America, and low heights over the southeastern U.S. (Yuan et al. 2015). WP manifests as a north–south dipole of anomalies during winter and spring, with centers over the Kamchatka Peninsula and Southeast Asia/western tropical North Pacific (Choi and Moon 2012). Jones NAO is a variant of the NAO index, using data from Gibraltar and southwest Iceland, extends the record back to 1823. It’s particularly relevant for winter (Hurrell 1995; Jones et al. 1997). The East Atlantic/West Russia pattern is a significant TP affecting Eurasia year-round, marked by anomalies between Western Europe, North China, the North Atlantic, and the Caspian Sea, fluctuating between positive and negative phases (Barnston and Livezey 1987; Ionita 2014). IOD involves fluctuations in sea surface temperatures between the western and southeastern equatorial Indian Ocean, measured by DMI (Pompa-Garcia and Nemiga 2015). The Nino3.4 index tracks sea surface temperature anomalies in the equatorial Pacific, used to monitor El Niño and La Niña events, with anomalies evaluated over a 30-year base period (Abiy et al. 2019; Trenberth et al. 2023).

3.2 Artificial neural networks (ANNs)

Artificial intelligence (AI) has derived its neural networks from the neural processes of living things (Keskin and Terzi 2006). Neural networks have been extensively adopted as estimators in numerous complex fields. Specifically, within the context of the Multilayer Perceptron (MLP), the aim is to decipher the input–output mapping by propagating the inputs via a sequence of hidden layers with weighted nonlinear transformations (Seyedashraf et al. 2018). The result of an MLP regression network with an output node can be defined by the input data following the (1) and (2) Equations:

$$f_{k } \left( x \right) = b_{k} + \mathop \sum \limits_{j} v_{jk} h_{j} \left( x \right)$$
(1)
$$h_{j } \left( x \right) = \tanh ( a_{j} + \mathop \sum \limits_{i} w_{ij} x_{i} )$$
(2)

In Eqs. (1) and (2), x represents the input vector where xi refers to specific features or predictors. The function hj(x) corresponds to the activation at hidden layer node j, weighted by wij, while vjk represents the weight from the hidden layer to the output, and bk ​is the bias for the output node. The Tanh activation function provides the necessary nonlinear environment to enable the estimation of intricate nonlinear functions. Various types of activation functions can be categorized into three main classes: binary (zero-value), linear, and nonlinear. The selection of activation functions within the context of neural networks is among the parameters that require calibration.

In the input–output prediction process for an MLP network, the network’s weights and biases are adjusted to reduce the magnitude of the input dataset’s error calculation (ED). The ED is a function of the input data set, denoted as D = {\({x}^{(i)}. {t}^{(i)}\)} that is to be minimized, as stipulated in the Eq. (3):

$$E_{D} = \frac{1}{2}\mathop \sum \limits_{i = 1}^{N} \left( {t^{\left( i \right)} - y\left( {X^{\left( i \right)} ;w} \right)} \right)^{2}$$
(3)

In Eq. (3), X(i) represents the i-th input data point, and t(i) is its corresponding target value. N denotes the total number of data points in the dataset, while y(X(i);w) is the model’s predicted value for the i-th input data.

Techniques based on numerical gradients, which advance sequentially and consecutively through the problem-solving space, are used to solve the minimization dilemma. This dilemma can be viewed through the lens of Eq. (4), which states that the numerical gradient of a function can be determined via the estimation of the partial derivatives in separate dimensions, using the known function values at given points (Goan and Fookes 2020).

$$w_{t + 1} = w_{t} - n\frac{{\partial E_{D} \left( w \right)}}{\partial w}$$
(4)

In Eq. (4), wt represents the model weights at the t-th iteration, and wt+1 is the updated weight vector. The term nnn refers to the learning rate, while ∂ED(w)/∂w is the gradient of the error function with respect to the weights. Similar to the activation function, n constitutes a parameter that must be calibrated for each model. If the value of n is deemed insufficient, the solution period is prolonged, and the outcome may not be the optimum result. On the other hand, if this value is deemed excessive, the model might diverge and not attain the solution. The errors are computed through the weighted connections, primarily utilizing backpropagation (Mbuvha et al. 2017). As aforementioned, the utilization of activation functions aims to create a nonlinear milieu in the neural network architecture to enhance the capacity of the model to identify the nonlinear relationships existing among the input variables (Wang et al. 2020a, b) Nine activation functions, including one linear and eight nonlinear types, as detailed in Table 4, were employed in this research.

Table 4 Neural Network Activation Functions used in this research

3.3 Introducing BNNs

This particular technique is classified as a form of decision support system, which serves as a versatile tool in the identification of cause-and-effect relationships in the form of a network of possibilities. The critical facet of the BNN method is that this approach does not necessitate precise information or the entire history of an incidence (Bonneville and Earls 2022), yet the algorithm generates satisfactory outcomes within the realm of hydro-meteorological variable estimation, even while utilizing incomplete and inaccurate data points (Maiti and Tiwari 2014; Kasiviswanathan and Sudheer 2017; Li et al. 2023). Additionally, the output of the BNN model is a probabilistic distribution, allowing for the analysis of uncertainty in the results (Bao et al. 2023; Bai and Chandra 2023).

3.3.1 Probabilistic concepts of BNNs

The neural network architecture illustrated earlier may be tailored through a range of adjustments to enable certain assumptions to be made about the data. Within this research, the Bayesian framework was used to configure these parameters (Lauret et al. 2008). The Bayesian framework enables the use of initial conditions for the distribution of network weights. These conditions can be specified on a unique or generalized basis for the weights across the entire network. The core principle underpinning the Bayesian framework is Bayes’ theorem, which specifies the distribution of the network parameters, incorporating the data and prior distribution assumptions (Goan and Fookes 2020). For a neural network structure denoted as "H" with inputs, hidden layers, and an output, the weights "w" and the training information "D," the Bayes’ formula’s definition reads as follows:

$$P\left( {w{|}D.H} \right) = \frac{{P{(}D{|} w. H) P(w|H)}}{P(D|H)}$$
(5)

where P(w| D.H) (read: the probability of occurrence of w under the condition of the occurrence of D and H) is the posterior probability of the weights considering the training data and the structure of the model. P(D| w.H) is the likelihood of the data used to train the model. P(w|H) represents the initial or prior distribution of weights. P(D|H) is the probability of the data used in the model, which is referred to as evidence or observations (Lampinen and Vehtari 2001).

3.3.2 BNN structure and components

  • The Likelihood

The likelihood is essentially the probability of a certain dataset emerging from the combination of parameters. Within the field of ML, the goal is to determine parameters that adequately represent the characteristics evident in the data. In that context, the data is considered fixed, whereas the parameters are variable, and are adjusted based on the training results. Per the principles of probability theory, the error function that is defined in Equation (3) is incorporated into the Noise Model making use of the negative of the logarithm of the Likelihood. If the Gaussian noise structure which utilizes the β parameter is selected, the Likelihood will be described via Equation (6).

$$\begin{aligned} P\left( {D|w. \beta . H} \right) & = \frac{1}{{Z_{D} \left( \beta \right)}} \exp \left( { - \beta E_{D} } \right) \\ & = \frac{1}{{Z_{D} \left( \beta \right)}} \exp \left( { - \frac{\beta }{2}\mathop \sum \limits_{i = 1}^{N} \left( {t^{\left( i \right)} - y\left( {X^{\left( i \right)} ;w} \right)} \right)^{2} } \right) \\ \end{aligned}$$
(6)

where ZD(β) is the normalization constant.

  • The Prior

Choosing an initial distribution can introduce constraints when choosing the weights of the network (Bai and Chandra 2023). Typically, it is common practice to take the Gaussian function with a zero mean and a precision of α as the prior distribution of the weights, which is as defined in Equation (7) (Zhang et al. 2022).

$$\begin{aligned} P\left( {w|\alpha . H} \right) & = \frac{1}{{Z_{w} \left( \alpha \right)}} \exp \left( { - \alpha E_{w} } \right) \\ & = \frac{1}{{Z_{w} \left( \alpha \right)}} \exp \left( { - \frac{\alpha }{2}\mathop \sum \limits_{i = 1}^{N} w_{i}^{2} } \right) \\ \end{aligned}$$
(7)

Consequently, Zw represents the normalization constant while Ew can be considered as the log prior probability distribution of over the weights.

  • The Posterior

The posterior distribution of weights will be obtained from Equation (8), which consists of:

$$\begin{aligned} P\left( {w|\alpha .\beta . H} \right) & = \frac{1}{{Z_{ } \left( {\alpha .\beta } \right)}}\exp \left( { - \left( {\alpha E_{w} + \beta E_{D} } \right)} \right) \\ & = \frac{1}{{Z_{M} \left( {\alpha .\beta } \right)}}\exp \left( { - M\left( w \right)} \right) \\ \end{aligned}$$
(8)

In the Bayesian framework, fixing hyperparameters α and β allows us to find the maximum a posteriori (MAP) estimate of the weight vector by minimizing the log posterior probability M(w) through traditional error backpropagation.(Mbuvha et al. 2017; Zhang et al. 2022).

  • Hyperparameters Estimation

α and β in Eq. (8) are the hyperparameters of the BNN that control the prior distributions on the weights and the noise, respectively. The selection of these hyperparameters is crucial for the performance of the BNN. There is no one-size-fits-all approach, and different techniques such as variational inference (Tzikas et al. 2008) or Markov chain Monte Carlo (MCMC) (Maire et al. 2019) can be employed to find suitable values for α and beta. We implemented the mean-field variational inference in this study. The mean-field variational inference method approximates the true posterior distribution p(w|D) over the weights w given the data D, with a simpler factorized distribution q(w) =q(α).q(β).

The variational parameters α and β are then optimized to minimize the Kullback-Leibler (KL) divergence between the approximate distribution q(w) and the true posterior p(w|D) which is provided in Equation (9):

$$KL(q\left( w \right) || p(w|D)) = E\_q\left( w \right)[log q\left( w \right) - log p(w|D)]$$
(9)

By expanding the terms, we get:

$$\begin{aligned} & KL(q\left( w \right)||p(w|D)) = E\_q\left( w \right)\left[ {log~q\left( w \right)} \right] \\ & \quad - E\_q\left( w \right)\left[ {log~p\left( {w.D} \right)} \right] + log~p\left( D \right) \\ \end{aligned}$$
(10)

The last term log p(D) is a constant with respect to the variational parameters. Therefore, minimizing the KL divergence is equivalent to maximizing the evidence lower bound (ELBO):

$$ELBO = E\_q\left( w \right)\left[ {log p\left( {w.D} \right)} \right] - E\_q\left( w \right)\left[ {log q\left( w \right)} \right]$$
(11)

E_q(w) represents the expectation taken with respect to the approximate distribution q(w), allowing us to calculate the KL divergence and ultimately maximize the ELBO, guiding the optimization process to find a good approximation of the true posterior distribution. Optimizing the ELBO with respect to q(α) and q(β) gives us the update equations for α and β. The specific forms of these update equations depend on the choices made for the approximate distributions q(α) and q(β). For example, if we assume q(α) and q(β) to be Gaussian distributions with means and variances as the variational parameters, the update equations would involve taking expectations over these Gaussian distributions and setting the derivatives of the ELBO with respect to the means and variances to zero (Goan and Fookes 2020; Chang 2021).

In essence, we don’t directly choose α and β, but rather the optimization process guides them towards values that minimize the KL divergence. The specific forms of the update equations for α and β depend on the chosen forms for the approximate distribution q(w) (e.g., Gaussian).

Here’s the variational inference workflow:

  1. 1.

    Initial Guess: We start with an initial guess for the approximate distribution q(w), often a simple distribution like a fully-factorized Gaussian distribution. This initial guess defines the initial values for alpha and beta (which might be the means and variances for the Gaussian case).

  2. 2.

    Expectation (E_q(w)) and Minimization: The terms within the ELBO equation involve expectations taken with respect to the current approximate distribution (q(w) defined by alpha and beta). These expectations are calculated using E_q(w). The ELBO is then calculated based on these expectations.

  3. 3.

    Optimization: The goal is to minimize the KL divergence (equivalent to maximizing the ELBO) by adjusting the variational parameters (alpha and beta). This is typically done using an optimization algorithm like gradient descent. The algorithm iteratively updates alpha and beta based on the gradients of the ELBO with respect to these parameters.

  4. 4.

    Repeat: Steps 2 and 3 are repeated until convergence is achieved. Convergence means the updates to alpha and beta become very small, indicating that the approximate distribution (q(w)) has reached a good approximation of the true posterior (p(w|D)).

  • Confidence Intervals Calculation

In this work, we created a probabilistic BNN model that outputs a distribution. This approach takes into account both aleatoric uncertainty due to data noise and the intrinsic randomness of the data-producing process. Instead of predicting a single point estimate, this approach models the output as an Independent Normal distribution with learnable mean and variance parameters. This allows us to use the negative log likelihood as our loss function to evaluate the model’s ability to reproduce the true data (targets). As a result, we can calculate confidence intervals (CI) for the predictions. The 95% CI is computed as \(\mu \pm (1.96 \times \sigma )\) which \(\mu\) and \(\sigma\) are the mean and standard deviation (stdv) of the output distribution. The upper and lower bounds calculated following the Eq. (12):

$$\begin{aligned} & Upper = \left( {prediction\_mean + \left( {1.96 * prediction\_stdv} \right)} \right) \\ & Lower = \left( {prediction\_mean - \left( {1.96 * prediction\_stdv} \right)} \right) \\ \end{aligned}$$
(12)

Figure 3 illustrates the structure of a BNN model, which contains 3 input datasets (Red Layer) that are fed into the nodes of the hidden layer (Yellow Layer). The weights are selected based on the normal probability distribution. The output probability distribution or the predicted value is generated by the model in the third layer (Blue Layer). The mean of this distribution is perceived as the anticipated value, which is then compared with the actual value for reference.

Fig. 3
figure 3

Structure of BNNs

3.4 The mutual information approach to predictive scenarios development

Contemporary approaches to selecting the relevant inputs and devising scenarios typically utilize adaptations of Pearson’s Correlation Coefficient (Pearson 1896), which are extensively leveraged to study the interconnections among various parameters in copious flow forecasting analyses (Rachmawati et al. 2021; Wang et al. 2020a, b). Despite the simple application of the correlation metric in expansive data sets, this coefficient only measures the linear relationship between random variables (Zhang et al. 2019).

In this study, we selected variables that directly or indirectly affect dam inflow based on Mutual Information content. Mutual information (MI) is a measure of the amount of information one random variable has about another (Kraskov et al. 2004). In other words, the MI of two random variables is a quantitative measurement of the amount of dependence (information) between the two random variables. Unlike correlation coefficients that can only handle linear dependence, MI can detect linear and non-linear relationships between variables, a property that made it a popular choice for variable selection (Sulaiman and Labadin 2015). From information theory, formally, the MI is defined as Eq. (13):

$$I\left( {X.Y} \right) = \iint \mu \left( {x.y} \right)\log \frac{{\mu \left( {x.y} \right)}}{{\mu_{x} \left( x \right)\mu_{y} \left( y \right)}}dxdy$$
(13)

where the marginal densities of X and Y are \({\mu }_{x}\left(x\right)=\int \mu \left(x.y\right)dy\) and \({\mu }_{y}\left(y\right)=\int \mu \left(x.y\right)dx\). This index can take values of zero or greater than zero. A zero value indicates no linear or non-linear correlation between the two variables x and y.

Considering the diverse time frames that the variables might exert an influence on the flow, all the variables including precipitation, inflow, and 8 circular patterns (Table 2) up to a 12-month delay were taken into consideration for predicting the inflow into each dam. Next, to investigate the correlation between potential predictors and the inflow to each dam, the MI index was tabulated to indicate the most pivotal periods for the influence of every parameter on each dam inflow. Based on the MI index, 5 key variables that were most correlated to the inflow into each dam were determined as the predictors and were consequently utilized to formulate flow prediction scenarios.

It’s significant to note that in cases where the available number of predictors is sizable, it may not be feasible to deploy them in every conceivable combination to formulate forecasting scenarios. For example, using 40 potential inputs would result in as many as one trillion different iterations (240–1). Therefore, this study resorted to a restricted selection of 5 variables as the inputs, generating (25–1 = 31) distinct scenarios to train the ANN and BNN models.

3.5 Calibration and test of ANN and BNN models

The features considered in our design for calibration and verification are the number of neurons and layers, activation function, learning rate, batch size, and number of epochs. These properties were determined under the constraints of archiving the smallest output error and the simplest ANN structure (Almassri et al. 2018). Therefore, the network architecture design started with fewer hidden neurons and then adjusted the number of hidden neurons. The proposed architecture of the ANN and BNN models that allow optimal results (with the least RMSE error) for each scenario was maintained and the results were used for model assessment.

To evaluate the performance of the ANN and BNN models for each of the 31 defined scenarios, we split the data into training and testing sets. 80% of the data was used for training the models, while the remaining 20% was used for testing. To avert overfitting, a random choice of the data has been utilized in both the training and test stages. The adopted approach for training the models is the stochastic gradient descent method (Ketkar 2017). In this procedure, two parameters must be defined, the epoch and batch size parameters which will be explained in detail below.

The stochastic descent gradient, an optimization algorithm, is utilized for the training of artificial intelligence-based learning algorithms, especially DL neural networks. The task of this algorithm is to identify a set of model parameters that perform well against a certain assessment criterion such as the mean square error. The algorithm is iterative. This means that the search process is performed in multiple discrete steps, hoping to incrementally improve the model parameters with each iteration. Each cycle consists of utilizing the model using the current parameter set for making predictions on some of the samples, comparing the predictions with the actual results, calculating the error, and using the error to update the model’s internal parameters. The updating method varies depending on the algorithm, but in the case of ANNs, the backpropagation algorithm is commonly utilized (Amari 1993; Bao et al. 2023).

For the random descent gradient approach, the entire data is divided into batches of a specified size. This parameter can range from one to the total size of the sample. For instance, if 320 samples exist, the batch size parameter could be any number ranging between 1 to 320. In case, for example, the value of this parameter is set as 32, the data will be divided into 10 batches of 32 elements each. Thus, each execution of the model will utilize one batch, the totality of which comprises 10 iterations of training data utilization. Epoch, on the other hand, refers to the number of iterations during which the entire data sample is utilized for the training of the model (Li et al 2024a, b). In addition to defining the aforementioned parameters for training the model, the technique of shuffling is employed. Through this technique, after each complete execution of the model (1 epoch), the order of the data samples is altered so that the data within each batch also changes, and the model is trained once again. Hence, after each execution of the model, the data is once more divided into training and test sets, with the shuffle technique being executed before this partitioning. The repetition of this process ensures that the data is used for training the model in various permutations, and prevents overfitting for a specific data set (Li et al 2024a, b). It should be noted that the ANN and BNN models are implemented using Keras and Tensorflow libraries in the Python environment.

3.6 Performance evaluation of ANN and BNN models

3.6.1 Deterministic metrics

Since the output of the BNN model is a probabilistic distribution, the average value of this distribution is utilized to calculate evaluation criteria. The traditional evaluation metrics, such as Normalized Root Mean Square Error (NRMSE), Nash-Sutcliff Error (NSE), and Symmetric Mean Absolute Percentage Error (SMAPE), were utilized alongside Akaike’s Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC) evaluation benchmarks. The calculation methods of all metrics mentioned are listed in Eqs. (14) to (19).

$$NRMSE = \frac{{\sqrt {\frac{1}{n}\mathop \sum \nolimits_{t = 1}^{n} (Q_{t}^{observed} - Q_{t}^{Predicted} ) ^{2} } }}{{\sqrt {\frac{1}{n}\mathop \sum \nolimits_{t = 1}^{n} (Q_{t}^{observed} ) ^{2} } }}$$
(14)
$$NSE = 1 - \frac{{\mathop \sum \nolimits_{t = 1}^{n} (Q_{t}^{observed} - Q_{t}^{Predicted} ) ^{2} }}{{\mathop \sum \nolimits_{t = 1}^{n} (Q_{t}^{observed} - \overline{{Q_{t} }}^{ observed} ) ^{2} }}$$
(15)
$$SMAPE = \frac{100}{n}\mathop \sum \limits_{t = 1}^{n} \frac{{\left| {Q_{t}^{observed} - Q_{t}^{Predicted} } \right|}}{{\left( {\left| {Q_{t}^{observed} } \right| + \left| {Q_{t}^{Predicted} } \right|} \right)/2}}$$
(16)

where n constitutes the number of data points, \({Q}_{t}^{observed}\) stands for actual data, and \({Q}_{t}^{Predicted}\) represents predicted data.

$$AIC = T\log \left( \frac{SSE}{T} \right) + 2\left( {k + 2} \right)$$
(17)
$$SSE = \mathop \sum \limits_{t = 1}^{n} (Q_{t}^{observed} - Q_{t}^{Predicted} ) ^{2}$$
(18)

where T is the number of observations used for estimation and k is the number of predictors in the model. Different computer packages use slightly different definitions for the AIC, although they should all lead to the same model being selected. The k + 2 part of the equation occurs because there are k + 2 parameters in the model: the coefficients for the predictors, the intercept, and the variance of the residuals. The idea here is to penalize the model’s fit which is the Sum of Squares of Errors (SSE) with the number of parameters that need to be estimated. The model with the minimum value of the AIC is often the best forecasting model (Hyndman and Athanasopoulos 2018).

$$BIC = T\log \left( \frac{SSE}{T} \right) + \left( {k + 2} \right)\log \left( T \right)$$
(19)

As with the AIC, minimizing the BIC is intended to give the best model. The model chosen by the BIC is either the same as that chosen by the AIC, or one with fewer terms. This is because the BIC penalizes the number of parameters more heavily than the AIC (Hyndman and Athanasopoulos 2018).

3.6.2 Probabilistic metrics

In addition to traditional evaluation methods, specialized probabilistic metrics are necessary for models like BNN. These metrics evaluate the entire predictive distribution, capturing uncertainty and offering a more thorough understanding of the model’s performance. This ensures that the model is not only accurate but also well-calibrated for making reliable probabilistic forecasts (Hersbach 2000).

3.6.2.1 Continuous Ranked probability Score (CRPS)

CRPS is one of the practical score used to assess the accuracy of probabilistic forecasts. For a probabilistic forecast F(x), which can be a cumulative distribution function (CDF) of the forecasted quantity, and an observed value Qobserved, the CRPS is given by Eq. (20):

$$CRPS\left( {F. Q_{observed} } \right) = \mathop \int \limits_{ - \infty }^{\infty } \left( {F\left( Q \right) - 1\left( {Q > Q_{observed} } \right)} \right)^{2} dx$$
(20)

where:

  • F(Q) is the CDF of the forecasted distribution,

  • \(1\left(Q>{Q}_{observed}\right)\) is the indicator function, which is 1 if \((Q>{Q}_{observed})\) and 0 otherwise (Zamo and Naveau 2018; Pic et al. 2022; Taillardat et al. 2023).

In the case of a normal Gaussian distribution with mean (\(\mu\)) and standard deviation (\(\sigma\)), which has been implemented in the present research, the CRPS can be computed using an analytic expression which is presented in Eq. (21):

$$CRPS\left( {F_{\mu .\sigma } . Q_{observed} } \right) = \sigma \left[ {z\left( {2\phi \left( z \right) - 1} \right) + 2\varphi \left( z \right) - \frac{1}{\sqrt \pi }} \right]$$
(21)

where:

  • \(z=\frac{{Q}_{observed}- \mu }{\sigma }\),

  • \(\phi \left( z \right)\) is the cumulative distribution function (CDF) of the Gaussian distribution,

  • \(\varphi \left(z\right)\) is the probability density function (PDF) of the Gaussian distribution (Gneiting and Raftery 2007; Garg et al 2022).

CRPS varies from 0 to \(+\infty\), and its unit is the same as that of the target variable. The CRPS decreases when the predicted distribution is concentrated close to the actual observed value, implying more accurate and sharper predictions. Therefore, models producing tighter, more precise distributions around observations receive better CRPS scores. For ANN (which produces point estimates, not distributions), the standard deviation (\(\sigma\)) is set to 0, and in this case, the CRPS simply becomes the absolute error between the observed and predicted values.

3.6.2.2 Continuous ranked probability skill score (CRPSS)

The CRPSS is a relative measure that compares the performance of a forecast against a reference forecast (Alfieri et al 2014). It is defined as:

$$CRPSS = 1 - \frac{{CRPS_{model} }}{{CRPS_{reference} }} = 1 - \frac{{CRPS_{test} }}{{CRPS_{train} }}$$
(22)

Using CRPSS to compare model performance during the train and test phases is practical because it evaluates relative improvement over a reference model (in this case, the model’s performance on training data). It shows how much skill the model retains when transitioning to new, unseen data (test phase). A higher CRPSS indicates better performance during testing relative to training, meaning the model generalizes well beyond its training data. A CRPSS value closer to 1 indicates a high skill of the model in the test phase relative to the training phase, while a negative value suggests poorer performance in the test phase compared to the training phase. This approach ensures that model evaluation considers both calibration and sharpness of the predictive distribution during training and testing.

3.6.2.3 Probability Integral Transform (PIT)

The Probability Integral Transform (PIT) is a technique used to evaluate the calibration of probabilistic forecasts. It converts the predicted probabilities into uniform values on the interval [0,1]. For a predicted cumulative distribution function F, the PIT value for an observed outcome Q is calculated as:

$$PIT\left(Q\right)=F(Q)$$

A well-calibrated model will produce PIT values uniformly distributed between 0 and 1. In the context of BNNs, PIT plots are practical because they help visualize how well the model’s predicted distribution aligns with observed data. They assess the model’s predictive uncertainty, allowing for direct evaluation of its probabilistic predictions. By analyzing the distribution of PIT values, researchers can identify areas of overconfidence or underconfidence in predictions, aiding in model refinement (Laio and Tamea 2007; Wang and Robertson 2011).

4 Results and discussion

The results section is organized to present the findings related to the selection of relevant factors, model training and testing, and inflow prediction for each of the three dams (Zayanderood, Amirkabir, and Karun 3).

4.1 Selection of the relevant factors and scenarios formation

As shown in Fig. 4, the MI index value exhibited alteration based on the time lag between the selected variables and the runoff entering the dam. First, the variables with the highest MI index scores were selected, and then, the most optimal time delay for each of the 5 aforementioned variables was determined in the range of 1 to 12 months. Considering Fig. 4a as an instance, the input variables consisting of inflow, precipitation, PNA, Jones NAO, and Nino 3.4 were selected for the prediction of the Zayanderood Dam inflow. The suitable time lags for the variables mentioned above were determined as 12, 3, 11, 5, and 1 months, respectively.

Fig. 4
figure 4

Correlation Matrix based on MI values for a Zayanderood Dam, b Amirkabir Dam, c Karun3 Dam

As demonstrated by Fig. 4b, the MI index results revealed the input variables inflow, precipitation, WP, Jones NAO, and Nino 3.4 to exert the highest influence on the Amirkabir Dam inflow. The most optimal time delays for the aforementioned variables were estimated as 12, 2, 6, 6, and 7 months respectively. As for the Karun dam, the highest association between the inflow and the variables of inflow one month prior, precipitations two months prior, and the PDO, Jones NAO, and Nino 3.4 indices were observed with delays of 3, 4, and 12 months, respectively.

4.2 Model training and testing

Table 5 exhibits an example of the process of determining the optimal parameters of the BNN model in one of the scenarios for the inflow prediction at the Zayanderood Dam. As observed, initially, the number of neurons and layers has been optimized (Run ID 1 to 19), leading to the optimal solution of a single hidden layer of 6 neurons. As mentioned previously, the model possessing the simplest structure, specifically featuring fewer layers and neurons and minimal error, is the one selected. Continuing further, the type of activation function is adjusted, with the Gelu being chosen as the optimal option. The following steps involve a dynamic adjustment of the other parameters by varying the default values in either direction, seeking optimal solutions. Upon conclusion, the optimal values of parameters for the applied BNN model are illustrated in row 32 of Table 4. This process has also been carried out for all other forecasting scenarios and applied to both ANN and BNN models, in search of optimal parameter values. Upon running both ANN and BNN models under 31 forecasting scenarios (Sx stands for scenario number x) for each of the investigated zones, the results were evaluated using AIC, BIC, NRMSE, SMAPE, and NSE metrics, which can be found in Tables 6, 7, and 8.

Table 5 Determination of the optimal value of BNN parameters (Zayanderood Dam – S17)
Table 6 Results of ANN and BNN models in Inflow prediction of Zayanderood Dam – Test phase
Table 7 Results of ANN and BNN models in Inflow Prediction of Amirkabir Dam – Test phase
Table 8 Results of ANN and BNN models in Inflow prediction of Karun 3 Dam – Test phase

4.3 Comparison of deterministic metrics for ANN and BNN

4.3.1 Zayanderood dam inflow prediction

In Table 6, the inclusion of additional teleconnection indices, specifically PNA and Jones NAO in S17 and S18, improves the inflow prediction performance for Zayanderood Dam. Compared to the simpler model in S6, the NRMSE for BNN decreases by 4.66 (S17) and 4.20 (S18), with a corresponding improvement in NSE of 0.09, indicating better accuracy. While scenario S31, with all variables included, yields the best performance (NRMSE = 10.93 and NSE = 0.75), the AIC and BIC metrics suggest that S17 offers a more optimal balance between model complexity and performance.

According to the test phase results, both the BNN and ANN models demonstrate reasonably strong performance for inflow values within the range of the 1st to 9th deciles, as evidenced by the metrics (Table 6) and Q-Q plots (Fig. 5a). However, BNN performs slightly better than ANN, especially in capturing the overall trend of the observed inflows, reflected in its lower NRMSE and higher NSE values. For extremely high inflows (10th decile), both models show a noticeable underestimation of the actual values. This deviation highlights the challenge both models face when predicting extreme events, with neither able to fully capture the magnitude of the highest inflows.

Fig. 5
figure 5

Q-Q Plots of ANN and BNN Predictions for Train and Test Datasets of: a Zayanderood dam, b Amirkabir dam, and c Karun 3 dam

4.3.2 Amirkabir dam inflow prediction

Based on the results of Table 7, S18 is the winning scenario in the prediction of the inflow to Zayanderood with the variables being: 12-month lagged inflows, 2-month lagged precipitation, and 6-month lagged WP index. However, the addition of other TP indices did not increase the accuracy of the model. Moreover, the comparison between S1 and S2-S5 shows that the inflow variable with a 12-month lag, is the most effective influencing variable. Upon analysis, the scenarios in which the inflow variable is not utilized demonstrate a noticeable decrease in the model’s accuracy. The addition of the precipitation variable to S6 and the WP index to S18 has markedly increased the accuracy of the model. A comparison between S6 (Q(12), R(2)) and S18 (Q(12), R(2) and WP(6)) reveals that adding the WP index has improved the values of the NRMSE, SMAPE and NSE indexes by 4.48%, 5.19%, and 11%, respectively.

The Q-Q plot (Fig. 5b) shows the performance of both BNN and ANN models in predicting inflows for the test dataset. The general trend indicates that both models perform relatively well in estimating inflows for the majority of the quantiles, particularly for values up to the 5th decile. For these ranges, the predictions are close to the reference line, indicating that both models adequately capture the observed values. However, as we move towards the higher quantiles, particularly from the 9th decile onwards, both models exhibit noticeable deviations from the observed values, underestimating the actual inflows. Specifically, most of the errors between the 5th and 9th deciles for ANN result from overestimations, while for BNN, the errors in this range stem mainly from underestimations. This behavior is more pronounced for extreme inflows, where the BNN model performs slightly better than ANN based on its smaller deviation. Now, in conjunction with Table 7 (test phase results), we observe that BNN generally outperforms ANN, particularly in scenarios with multivariate inputs. Scenario S18, which includes lagged inflows, precipitation, and the WP index (Q(12), R(2), WP(6)), shows that BNN achieves a significantly lower NRMSE (28.85) and better NSE (0.82) compared to ANN (NRMSE: 29.17, NSE: 0.77), demonstrating a better fit for BNN. This aligns with the Q-Q plot, where BNN shows a closer match to observed values for most quantiles.

4.3.3 Karun 3 dam inflow prediction

According to the findings presented in Table 8, scenario S27 shows the best performance across all evaluation indices. Using a combination of four variables (historical inflow, precipitation, PDO index, and Nino 3.4 index) as predictors results in the highest accuracy for both the ANN and BNN models in inflow estimation. The combination of station-based variables such as inflow and precipitation, together with TPs like PDO and Nino 3.4, contributes to enhanced flow prediction accuracy. S18 ranks second in terms of prediction accuracy, differing from S27 only by the absence of the Nino 3.4 variable. The addition of Nino 3.4 improves the values of NRMSE, SMAPE, and NSE by 2.42%, 5.45%, and 6%, respectively, further demonstrating its significance in inflow prediction for Karun 3.

Regarding the AIC and BIC metrics, there is no particularly unusual or distinct pattern that stands out. Both models (ANN and BNN) show general consistency in terms of these metrics, with slightly lower values observed in scenarios that involve multivariate inputs (e.g., S27 and S18) compared to univariate cases. This aligns with the idea that incorporating more relevant variables can improve the models’ fit without over-penalizing for model complexity, as reflected in these criteria.

In the train phase, up to the 9th decile, both models (ANN and BNN) perform reasonably well. However, beyond the 9th decile, both models tend to underestimate extreme values. In the test phase, while ANN shows good performance in some extreme cases, its error is higher than BNN between the 5th and 9th deciles. Overall, based on both the Q-Q plot (Fig. 5c) and the deterministic metrics in the Table 8, the performance of the two models in inflow prediction of Karun3 dam is remarkably similar to each other.

4.4 Probabilistic evaluation of BNN model performance

4.4.1 Confidence intervals and model uncertainty

The Fig. 6a shows the BNN model’s predicted inflow values (mean of the prediction distribution) for Zayanderood Dam, along with the uncertainty range (shaded in purple), which encompasses the lower and higher prediction bounds. The predicted values (blue line) generally follow the actual inflows (red line), capturing the trend and most of the peaks and troughs. Importantly, the uncertainty intervals expand during periods of high inflow, reflecting increased prediction uncertainty in those months. Notably, the actual inflows consistently fall within the uncertainty bands, including for extreme inflows. This indicates that the BNN model provides well-calibrated predictions and effectively accounts for uncertainty across the range of inflow values.

Fig. 6.
figure 6

95% Confidence Interval of the best inflow prediction scenarios for a Zayanderood Dam, b Amirkabir Dam, c Karun 3 Dam – Test dataset

Regarding Fig. 6b, the BNN model demonstrates a strong ability to estimate inflow to the Amirkabir Dam. Notably, the lower bound of uncertainty calculated for most months is zero, suggesting that the potential for the river supplying the Amirkabir Dam to experience dry conditions is a legitimate concern.

The comparison of actual and forecasted values by the BNN model in Fig. 6c depicts that despite the acceptable performance of the BNN model in forecasting the minimum and average discharge, in estimating peak discharge, the model contains errors with underprediction. Although the computed interval of uncertainty is broad, the average of the forecasted values probability distribution has a high alignment with the observed data.

4.4.2 PIT plots

As shown in Fig. 7a and b, relating to the training and test phases respectively, the PIT plots of the Zayanderood dam inflow indicate that the BNN model performs reasonably well in the 1st, 2nd, and 4th quartiles. While there are some deviations from the reference line, the predictions remain within the upper and lower Kolmogorov bands, which are the bounds defined by the Kolmogorov–Smirnov statistic at a 5% significance level, delineating the area where the predicted cumulative distribution function (CDF) is expected to lie if it closely follows the observed CDF. This indicates that the predicted CDF closely follows the observed CDF in these quartiles. However, the model underestimates the inflow values in the 3rd quartile, which ranges from 26.64 to 55.20 m3/s according to the box plot (Fig. 7c).

Fig. 7
figure 7

PIT Plots of the best inflow prediction Scenario for Zayanderood Dam: a Training Phase, b Testing Phase, and c Box Plot of the Observed Values

Figure 8a and b illustrate the PIT plots for the Amirkabir dam inflow during the training and test phases, revealing that the BNN model demonstrates solid performance overall, particularly in the 2nd quartile where the predictions align well with the observed values. While the 1st quartile exhibits an overestimation error, the model’s predictions remain reasonably close to the Kolmogorov bands, indicating satisfactory adherence to the observed cumulative distribution function (CDF). In the 3rd and 4th quartiles, the model experiences underestimation errors; however, these values are still relatively close to the Kolmogorov bands. The box plot in Fig. 8c reveals a significant presence of outliers in the 4th quartile, with actual inflow values ranging from 16.23 to 70 m3/s. Despite these outliers, the overall performance of the BNN model remains robust, reflecting its capability to capture the inflow dynamics with reasonable accuracy.

Fig. 8
figure 8

PIT Plots of the best inflow prediction Scenario for Amirkabir Dam: a Training Phase, b Testing Phase, and c Box Plot of the Observed Values

The PIT plots for the Karun 3 dam inflow, depicted in Fig. 9a and b for the training and test phases, demonstrate that the BNN model exhibits excellent performance in the higher quartiles (3rd and 4th), where the predicted cumulative distribution function (CDF) closely aligns with the actual CDF. In the 2nd quartile, while the model’s predictions fall within the Kolmogorov bands, some discrepancies indicate a tendency to overestimate the inflow values. Conversely, the 1st quartile shows a significant deviation in model performance, primarily attributed to the presence of outliers, specifically extremely low inflow values, as illustrated in the accompanying box plot (Fig. 9c). Despite these challenges in the lower quartiles, the overall performance of the BNN model remains commendable, effectively capturing the inflow dynamics in the higher quartiles.

Fig. 9
figure 9

PIT Plots of the best inflow prediction Scenario for Karun3 Dam: a Training Phase, b Testing Phase, and c Box Plot of the Observed Values

4.4.3 CRPS and CRPSS

CRPS values provide insights into the performance of both the ANN and BNN models in predicting inflows for the Zayanderood, Amirkabir, and Karun 3 dams. The CRPS values for both the training and testing phases of the ANN and BNN models which are presented in Fig. 10a, are all below 15 m3/s, reflecting a high level of predictive accuracy. Among these, the BNN test has the lowest value, followed by BNN train, indicating that BNN provides the most accurate probabilistic predictions for this dam’s inflow. ANN’s test and train results show slightly higher CRPS values, reinforcing BNN’s superior performance. The overall consistency across training and testing phases for both models shows robust model behavior, though BNN clearly outperforms ANN.

Fig. 10
figure 10

Comparison of ANN and BNN Models Using the a CRPS and b CRPSS Values for Best Inflow Prediction Scenarios Across Studied Dams

For Amirkabir Dam, the ranking from lowest to highest CRPS values is: BNN test (1.77 m3/s), BNN train (2.27 m3/s), ANN train (3.01 m3/s), and ANN test (3.17 m3/s). The close proximity of these scores shows that all models perform similarly, but BNN maintains a slight advantage. The relatively higher CRPS for ANN indicates a marginal decrease in performance compared to BNN, further emphasizing the latter’s strength in capturing the probabilistic uncertainty associated with Amirkabir’s inflows.

The CRPS values for Karun 3 Dam range between 30 m3/s and over 60 m3/s, reflecting greater uncertainty and error in predictions compared to the other two dams. The ranking remains the same, with BNN test performing best, followed by BNN train, ANN train, and ANN test. These higher CRPS values suggest that inflow predictions for Karun 3 are more challenging, likely due to more extreme inflow variations, yet BNN still shows better overall performance. Overall, the CRPS analysis highlights the superior performance of the BNN model across all three dams, particularly in the test phase where BNN consistently outperforms ANN in terms of lower CRPS values.

CRPSS results across the three dams reveal the models’ ability to generalize and predict unseen data accurately (Fig. 10b). The positive CRPSS for BNN (0.1493) indicates that the model’s predictive skill improves over the training phase, meaning the BNN performs well in predicting unseen data. On the other hand, the negative CRPSS for ANN (-0.0855) suggests that the ANN performs worse on unseen data compared to its training phase, showing a decline in predictive skill. For Amirkabir Dam, the BNN model again shows a positive CRPSS (0.2177), reflecting its ability to predict unseen data more accurately than during training. The slightly negative CRPSS for ANN (-0.0550) points to a marginally poorer performance on unseen data compared to its training phase. For Karun 3 Dam, both models exhibit positive CRPSS values, with BNN (0.2931) scoring higher than ANN (0.1763). This indicates that both models improve in their ability to predict unseen data in the test phase, but the BNN model shows a greater enhancement in skill.

5 Conclusion

River flow is a result of complex interactions between atmospheric and hydrological activities, and its forecasts can provide a platform for more efficient water resources management and averting risks associated with floods and drought. Given the interconnection between climatic-oceanic phenomena and hydrological processes at different scales, both historic local variables and atmospheric TP indices were considered for reservoir inflow forecasting of Three major dam reservoirs in Iran. As part of these endeavors, both ANN and BNN models were developed under the scenarios based on the aforementioned variables, and their performances were evaluated. Likewise, the objective of selecting the most influential input variables to the model and optimizing the delay variables was fulfilled using the MI index, which depicts the level of linear and non-linear correlations between the predictive and targeted variables. Summarily, the outcomes of the study may be articulated as follows:

  1. 1.

    By comparing the results obtained from the ANN and BNN models, it was determined that the BNN model predictions are generally closer to the observed inflow values. The Bayesian approach enhances the network’s accuracy by adjusting the model to account for uncertainty, particularly evident in probabilistic evaluations such as the PIT plots. These plots indicate that the BNN performs well across most quartiles, effectively capturing the dynamics of the inflow. In terms of deterministic metrics, the ANN model tends to overestimate lower inflow values while underestimating higher ones. Conversely, the BNN model shows better accuracy for lower inflows but underestimates the higher inflows. Additionally, the CRPS analysis reinforces this assessment, indicating that the BNN consistently outperforms the ANN, particularly in the test phase where BNN demonstrates the best performance in CRPS, confirming its effectiveness as a reliable forecasting tool.

  2. 2.

    Overall, the findings from calculating the MI index suggest that there are correlations between TPs and dam inflows. The most significant indices influencing monthly inflows to the Zayanderood dam respectively are Nino3.4, PNA, and Jones NAO with delays of one, eleven, and five months. Regarding the Amirkabir dam, the key parameters refer to: Nino3.4, Jones-NAO, and WP with delays of seven, six, and six months respectively. In regards to the Karun 3 dam, the maximum levels of correlations between flow and teleconnection indexes include Nino3.4, Jones NAO, and PDO, respectively, with delays of 12, 4, and 3 months. Therefore, it can be interpreted that the indices of Nino3.4, and Jones-NAO are more influential than other indices for the inflows of various dams in Iran.

  3. 3.

    Based on the results, atmospheric TPs possess the potential for usage as predictive variables for dam inflow, but discovery of the effective patterns as well as their influence timelines in different regions requires examination and analysis. Utilization of these patterns was able to enhance the precision of both the ANN and BNN predictive models.

The methodology adopted in this study is innovative in the sense that it combines information related to weather and climate on large scales with local hydrological procedures to assess the interactions between weather and water phenomena. Moreover, this approach is not limited to specific regions analyzed in the current endeavor but is rather a generalizable option. For the future studies, TPs which are effective in the other parts of the world as well should be discovered and employed in hydrological forecasts. Furthermore, the use of optimization techniques during the process of parameter estimation for the models used in the study (ANN and BNN) has the potential to improve the accuracy of the forecasts. Thus, the proposition is to employ other ML models and optimization algorithms in subsequent investigation. To select the predictors, different approaches such as Stepwise Forward Regression (Babaei et al. 2019) or Backward Elimination (Kim et al. 2019) have been employed in similar studies. These approaches rank the input variables by their correlation with the target variable and add or remove one variable at a time between subsequent scenarios. These methods result in an expansion of the input variables. However, the present attempt sought to minimize uncertainty stemming from measurement limitations and the complexity of the proposed models by limiting the number of input variables used. Hence, it is suggested to utilize other variable selection techniques and scenario development approaches in future studies, and the results shall be compared with the current endeavor. Although BNNs offer many advantages over other ML models, they face limitations in uncertainty estimation and computational efficiency. While they provide uncertainty estimates through techniques like dropout, the quality of these estimates varies based on approximate inference methods. This often results in a trade-off between uncertainty quality and computational complexity, hindering their use in large-scale models. Additionally, reliance on methods like variational inference can lead to less reliable predictions due to the inability to accurately capture the true posterior distribution (Gal and Ghahramani 2016; Blei et al 2017). Addressing these challenges requires further research to explore improved techniques and solutions.