Highlights

  • The study applies DL models for streamflow forecasting on an hourly time scale.

  • Six DL models, LSTM, GRU, CNN, CNN-LSTM, CNN-GRU, and CNN-RNN are compared.

  • The study incorporates a methodology for uncertainty assessment.

  • GRU model demonstrated more robust performance.

  • The predictions for longer lead times are less accurate in smaller basins.

Introduction

Accurate prediction of streamflow in rivers is essential for water resources management. It can help to increase the efficiency of hydroelectric power plants, to plan more efficient irrigation systems, to determine the operating schedule of reservoirs and to manage flood early warning systems, among other applications (Hao and Bai 2023; Farfán et al. 2020; Bermúdez et al. 2021). However, making predictions are particularly challenging due to the non-linear nature of streamflow time series (Zhao et al. 2021; Van et al. 2023). This complexity arises from a multitude of factors such as changing weather patterns, the physical characteristics of the basins, variations in terrain and soil types, and human-induced activities (Bermúdez et al. 2020; Dwarakish and Ganasri 2015; Goudie 2023). Such intricacies make streamflow prediction an ongoing challenge in hydrology, requiring a continuous refinement and adaptation of the existing methodologies to increase their precision and reliability.

The most widespread practice for streamflow predictions is the application of hydrological models, which are mathematical tools designed to simulate the response of a watershed to precipitation and other climatological variables (Couckuyt et al. 2009; Farfán and Cea 2021). Hydrological models can be classified into two main types: physically-based and data-driven models (Wang et al. 2017; Farfán et al. 2020). Physically-based models implement knowledge-based conceptual components, which provide significant hydrological insights (Huo et al. 2016; Farfán and Cea 2021). On the other hand, Deep Learning (DL) algorithms, as part of the broader category of data-driven methods, are based on the statistical relationship between hydrological variables (Yassin et al. 2021; Noori and Kalin 2016). These techniques are capable of capturing patterns in extensive data sets, relying solely on numerical input-output relationships (Yassin et al. 2021; Noori and Kalin 2016).

In recent years, DL techniques have emerged as powerful tools for forecasting streamflow, providing a valuable alternative to traditional hydrological models (Zhao et al. 2019; Van et al. 2023; Farfán et al. 2020; López-Chacón et al. 2023). The forecasting process could be done by using Recurrent Neural Network (RNN), or Convolutional Neural Networks (CNN). RNN are a type of DL architecture designed to handle sequential data by maintaining a memory of previous inputs in its internal state, enabling it to capture temporal dynamics and dependencies (Goodfellow et al. 2016; Nath et al. 2021). Among the most applied RNN techniques in river flow forecasting are the Long Short-Term Memory (LSTM) and the Gated Recurrent Units (GRU). On the other hand, CNN are effective in the identification of features in a data set for prediction tasks without directly learning from lagged observations (Ismail Fawaz et al. 2019; Indolia et al. 2018).

LSTM networks are an extension of recurrent neural networks that were designed to overcome the vanishing gradient problem, thereby enabling the learning of long-term dependencies (Hochreiter and Schmidhuber 1997; Hochreiter 1998; Goodfellow et al. 2016). Liu et al. (2020) employed a deep neural network incorporating Empirical Mode Decomposition (EMD) and Encoder-Decoder LSTM (En-De-LSTM) to predict river flow at the Hankou station in the Yangtze River, showcasing reliability in both long-term and catastrophic flood predictions. The model uses monthly streamflow data as input. Also, other hydrological variables including rainfall, temperature, and evapotranspiration data, where not included. The work by Le et al. (2021) in the Red River basin of Northern Vietnam evaluated six DL models for streamflow forecasting. The input data of the DL models was observed rainfall from the hydrological stations located in the study area and discharge data from the gauge station in the study area. The findings suggest that LSTM-based models outperformed FFNN and CNN, but the complexity of StackedLSTM and BiLSTM models did not enhance performance over standard LSTM and GRU models. Consequently, they recommended simpler LSTM and GRU models to produce reliable forecasts with reduced computation time.

GRUs, a simplified variant of LSTM, retain the capability to model temporal dependencies but with fewer parameters (Cho et al. 2014; Goodfellow et al. 2016; Nath et al. 2021). Recent research has further expanded the understanding of GRUs in streamflow forecasting. Wang et al. (2022) examined the performance of GRU for one to three-day-ahead predictions across seven Chinese basins, emphasizing its robustness and accuracy, particularly in one-day-ahead forecasts. The model input were streamflow and rainfall data in a single time step. In another approach, Zhao et al. (2021) coupled the GRU with an improved grey wolf optimizer (IGWO) to form the hybrid IGWO-GRU model for forecasting in the Shangjingyou and the Fenhe River. This monthly IGWO-GRU model demonstrated a strong mapping ability and outperformed traditional models, achieving an average qualification rate of 91.66% across two stations. The input data was monthly streamflow data at the Shangjingyou station and the Fenhe station.

CNNs, adept at processing grid-like data topology (such as images), have also been applied successfully to time-series data, using convolutions to extract significant features across time sequences (Goodfellow et al. 2016; Van et al. 2020, 2023). Shu et al. (2022) applied CNNs to focus on multi-step-ahead monthly streamflow forecasting using two novel models, DirCNN and DRCNN. These models apply CNNs to automatically extract input variables and predict streamflow for multiple lead times. The models were tested on several reservoirs in China and demonstrated superior performance compared to traditional ANNs, particularly in maintaining forecasting accuracy over extended lead times. The was developed in a monthly time scale and used as input candidates different Atmospheric Circulation Indexes (ACIs), precipitation, and streamflow. Similarly, another study by Van et al. (2020) employs a CNN for rainfall-runoff modeling. The model was evaluated using daily measured rainfall and discharge data from hydro-meteorological stations in the Vietnamese Mekong Delta and showed slightly better performance than LSTM models. Both studies highlight the potential of CNNs in capturing complex interactions in hydrological systems, offering a time-efficient and robust alternative to traditional and recurrent neural network models. Additionally, some studies combine CNN with LSTM or GRU layers for improved performance (Vatanchi et al. 2023; Dehghani et al. 2023; Ghimire et al. 2021; Sit et al. 2021; Khorram and Jehbez 2023; Kashem et al. 2024; Ekwueme 2024).

Fig. 1
figure 1

Geographical representation of (a) Groba catchment and (b) Anllóns catchment, showing the locations of stream gauges and meteorological stations

Despite the efforts of previous studies and the significant progress made in the application of DL techniques to streamflow forecasting, there remains a significant gap in the understanding and application of these methods in different hydrological contexts. Most of the cited studies focus on daily or monthly forecasting time steps and different hydrological input. Hourly forecasting, on the other hand, appears to have received little attention. This lack of attention may be significant, especially in smaller basins, due to their shorter response times, which is essential for the development of early warning systems (Sopelana et al. 2018; Fraga et al. 2020).

Another significant limitation observed in many Machine Learning (ML) and DL studies is the omission of uncertainty estimations Tripathy and Mishra (2023); Klotz et al. (2022). Uncertainty is inherent in hydrological modeling, encompassing aspects ranging from data inaccuracies to model structural uncertainties (Beven and Binley 1992; Beven 2016; Klotz et al. 2022). Ignoring such uncertainty might lead to overconfident predictions, which in a hydrological context, can have severe repercussions ranging from inefficient water resource management (Althoff et al. 2021; Beven 2016).

The present work aims to explore and evaluate the effectiveness of various deep learning (DL) models for predicting hourly streamflow. The models under investigation include Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN), and hybrid models such as CNN-LSTM, CNN-GRU, and CNN-Recurrent Neural Network (RNN). The study focuses on two basins of different sizes located in Northwest Spain. The DL models were applied at different lead times, and their performance was evaluated across the observed streamflow series of 10 years. An in-depth analysis was conducted to provide insight into the influence of lead times on the accuracy of peak flow predictions. To enhance the robustness and reliability of the predictions, this research incorporates the Monte Carlo Dropout (MCD) technique to assess the intrinsic uncertainties within the applied DL models. Additionally, evaluations were conducted with future rainfall data to determine how it affects the outcomes, offering further insights into the models’ capabilities under varying hydrological conditions.

Study area and data

The present work was carried out in the rivers Groba and Anllóns, located in Northwest Spain.

The Groba River Basin (Fig. 1) spans an area of 17.05 km\(^2\). The total length of the drainage network is 24.89 km. The basin exhibits a significant variation in elevation, ranging from a minimum of 35 meters above sea level (m.a.s.l.) to a maximum of 627 m.a.s.l. This variation reflects a complex topography that encompasses both low-lying areas and elevated terrains. The basin is characterized by its steep terrain, with an average slope of 27.4 %. This steepness in the orography of the hillslopes, combined with its proximity to the coastline, increases the spatial variability of precipitation within the basin, a phenomenon that has been substantiated in prior hydrological studies (Liang and Melching 2015; Cabalar Fuentes 2005; Cea and Fraga 2018).

The Anllóns River Basin covers an extensive area of 438 km\(^2\). The total length of its drainage network is 447 km. The basin has a varied topography, with elevations ranging from 59 m.a.s.l to 473 m.a.s.l. The average slope is 12.1 %, which is less steep compared to the Groba basin. This moderate slope is indicative of a relatively less complex topographical profile than its counterpart, and it may result in differing hydrological behaviors.

The observed precipitation and streamflow data used in this study were obtained from the agencies MeteoGaliciaFootnote 1 and Augas de Galicia,Footnote 2 respectively. These data, available with a time resolution of 10-minutes, are pre-processed and subjected to filtering procedures by the agencies to ensure reliability before being made publicly available. The retrieved historical precipitation and streamflow data span from December 2008 to September 2018, covering a period of approximately 10 years.

There are nine meteorological stations located around the Groba basin and 16 around the Anllóns basin. The registered rainfall series were spatially interpolated with a resolution of 1 km, using a natural neighbor interpolation, and then averaged over the whole catchments to obtain a basin-averaged time series (Bermúdez et al. 2021; Farfán and Cea 2023). Finally, both the interpolated precipitation series and the gauged streamflow series were aggregated on an hourly scale to align with the objectives of the present study.

Conceptual framework

Recurrent neural networks

RNNs are a specific type of Artificial Neural Networks designed to handle sequential data, capturing dependencies over time or sequence steps. This ability to process sequences makes RNNs suitable for applications like time-series forecasting, among others (Goodfellow et al. 2016).

The typical structure of a RNN includes input, hidden, and output layers. However, unlike standard feedforward neural networks, RNNs contain connections that loop back on themselves, allowing information to persist across time steps. These recurrent connections enable the network to remember previous inputs in the sequence (Goodfellow et al. 2016).

The connections are performed by means of different weight matrices that allow the calculation of different states using the following equation:

$$\begin{aligned} \varvec{h}_t= & \sigma (\textbf{W} \cdot \varvec{h}_{t-1} + \textbf{U} \cdot \varvec{x}_t + \varvec{b}_h) \end{aligned}$$
(1)
$$\begin{aligned} \varvec{o}_t= & \phi (\textbf{V} \cdot \varvec{h}_t + \varvec{b}_y) \end{aligned}$$
(2)

Here, \(\varvec{h}_t\) denotes the hidden state. The term \(\textbf{W} \cdot \varvec{h}_{t-1} \) captures the influence of the previous hidden state \(\varvec{h}_{t-1}\) on the current state. The transformation of the current input \(\varvec{x}_t\) at time t is denoted by \(\textbf{U} \cdot \varvec{x}_t \), and \(\varvec{b}_h\) is the bias term for the hidden layer. By integrating the previous hidden state with the current input, the network can capture temporal dependencies across sequential steps. Additionally, \(\sigma \) represents an activation function, such as the hyperbolic tangent (Goodfellow et al. 2016; LeCun et al. 2015). Once the hidden state \(\varvec{h}_t\) is computed, it is used to determine the output at time t, denoted as \(\varvec{o}_t\). Where \(\phi \) is an activation function. The transformation of the hidden state at time t into the output is represented by \( \textbf{V} \cdot \varvec{h}_t \). The bias term for the output layer is denoted by \( \varvec{b}_y \).

Then, matrix \(\textbf{U}\) could be defined as the input-to-hidden connection, which connect the input to the hidden layer, \(\textbf{W}\) denotes the hidden-to-hidden connections, which allow information to persist from one step to the next, and \(\textbf{V}\) corresponds to the hidden-to-output connections, which map the hidden layer to the final output.

Long short-term memory neural networks

LSTM networks are a type of standard RNN specifically designed to overcome the vanishing and exploding gradient problems faced by traditional RNNs (Hochreiter and Schmidhuber 1997; Goodfellow et al. 2016). LSTMs are capable of learning and retaining long-term dependencies in sequence data, surpassing traditional RNNs in performance and robustness (Goodfellow et al. 2016).

The architecture of a LSTM network includes three primary gates, namely the forget gate, the input gate, and the output gate, and two state variables, the cell state \(\varvec{c}_t\) and the hidden state \(\varvec{h}_t\). This unique structure differentiates it from the standard RNN architecture.

The key equations governing the operation of a LSTM network are as follows:

$$\begin{aligned} \varvec{f}_t&= \sigma (\textbf{U}_{f} \cdot \varvec{x}_t + \textbf{W}_{f} \cdot \varvec{h}_{t-1} + \varvec{b}_f)\end{aligned}$$
(3)
$$\begin{aligned} \varvec{i}_t&= \sigma (\textbf{U}_{i} \cdot \varvec{x}_t + \textbf{W}_{i} \cdot \varvec{h}_{t-1} + \varvec{b}_i)\end{aligned}$$
(4)
$$\begin{aligned} \varvec{o}_t&= \sigma (\textbf{U}_{o} \cdot \varvec{x}_t + \textbf{W}_{o} \cdot \varvec{h}_{t-1} + \varvec{b}_o)\end{aligned}$$
(5)
$$\begin{aligned} \varvec{c}_t&= \varvec{f}_t \otimes \varvec{c}_{t-1} + \varvec{i}_t \otimes \tanh (\textbf{U}_{c} \cdot \varvec{x}_t + \textbf{W}_{c} \cdot \varvec{h}_{t-1} + \varvec{b}_c)\end{aligned}$$
(6)
$$\begin{aligned} \varvec{h}_t&= \varvec{o}_t \otimes \tanh (\varvec{c}_t) \end{aligned}$$
(7)

where, \(\varvec{f}_t\), \(\varvec{i}_t\), and \(\varvec{o}_t\) are the forget, input, and output gates, respectively. \(\sigma \) is a sigmoid activation function. The weight matrices (\(\textbf{U}\) for the input connections and \(\textbf{W}\) for the hidden connections) and bias terms (\(\varvec{b}\)) are parameters that are learned during the training process.

Fig. 2
figure 2

Schematic representation of a CNN-1D. The architecture consists of convolutional layers for feature extraction, followed by pooling layers for dimensionality reduction. The final “Processing Layer*” can be of various types including fully connected, RNN, LSTM, or GRU, depending on the specific requirements of the model

Gated recurrent units

GRUs are another types of RNNs introduced by Cho et al. (2014). These units address some of the shortcomings of standard RNNs by incorporating gating mechanisms to effectively manage information in long temporal sequences. Unlike standard RNNs, GRUs introduce two gates: the update gate \(\varvec{z}_t\) and the reset gate \(\varvec{r}_t\), controlling how information flows through the network (Goodfellow et al. 2016; Nath et al. 2021; Wang et al. 2022).

The equations composing the GRU architecture are as follows:

$$\begin{aligned} \varvec{z}_t&= \sigma (\mathbf {U_z} \cdot \varvec{x}_t + \mathbf {W_z} \cdot \varvec{h}_{t-1} + \varvec{b}_z)\end{aligned}$$
(8)
$$\begin{aligned} \varvec{r}_t&= \sigma (\mathbf {U_r} \cdot \varvec{x}_t + \mathbf {W_r} \cdot \varvec{h}_{t-1} + \varvec{b}_r)\end{aligned}$$
(9)
$$\begin{aligned} \tilde{\varvec{h}}_t&= \tanh (\textbf{U} \cdot \varvec{x}_t + \varvec{r}_t \otimes \textbf{W} \cdot \varvec{h}_{t-1} + \varvec{b}_h)\end{aligned}$$
(10)
$$\begin{aligned} \varvec{h}_t&= (1 - \varvec{z}_t) \otimes \varvec{h}_{t-1} + \varvec{z}_t \otimes \tilde{\varvec{h}}_t \end{aligned}$$
(11)
$$\begin{aligned} \varvec{o}_t&= g(\textbf{V} \cdot \varvec{h}_t + \varvec{b}_o) \end{aligned}$$
(12)

The update gate \(\varvec{z}_t\) and the reset gate \(\varvec{r}_t\) are computed using a linear combination of the input \(\varvec{x}_t\) and the previous hidden state \(\varvec{h}_{t-1}\), passed through a sigmoid activation function \(\sigma \). The linear combination in the update gate is determined by the weight matrices \(\mathbf {U_z}\), \(\mathbf {W_z}\) and a bias term \(\varvec{b}_z\), while the reset gate is determined by \(\mathbf {U_r}\), \(\mathbf {W_r}\) and a bias term \(\varvec{b}_r\).

The candidate hidden state \(\tilde{h}_t\) is computed using the input \(\varvec{x}_t\) and a reset version of the previous hidden state \(\varvec{h}_{t-1}\) modulated by \(\varvec{r}_t\), using \(\textbf{U}\) and \(\textbf{W}\) along with the reset gate and a bias term \(\varvec{b}_h\). The new hidden state \(\varvec{h}_t\) is computed as an interpolation between the previous hidden state \(\varvec{h}_{t-1}\) and the candidate hidden state \(\tilde{h}_t\), weighted by \(\varvec{z}_t\). Finally, the output \(\varvec{o}_t\) is obtained by a linear transformation of the hidden state \(\varvec{h}_t\) using the weight matrix \(\textbf{V}\) and a bias term \(\varvec{b}_o\), passed through an activation function g that can vary depending on the task.

These equations allow GRUs to retain relevant information for longer durations in long sequences, thus overcoming some problems inherent in standard RNNs, such as the vanishing gradient problem (Zhao et al. 2021).

1D Convolutional neural networks

CNNs are a specific type of CNN designed for processing sequential data or time-series. These networks offer distinct advantages in handling multivariate inputs and outputs and are particularly effective in identifying relevant features for prediction tasks without directly learning from lagged observations (Ismail Fawaz et al. 2019; Indolia et al. 2018; Kiranyaz et al. 2021).

A typical CNN model consists of three primary types of layers: convolutional, pooling, and a final layer that is commonly composed of a fully connected ANN. Additionally, CNNs, could be hybridized with RNN by adding a type of RNN (RNN, LSTM, GRU), to the final layer each tailored to meet the specific objectives of the model (Le et al. 2020; Hu et al. 2020; Yu et al. 2021; Vatanchi et al. 2023; Khorram and Jehbez 2023). Figure 2 shows a scheme of a CNN and its main components.

  • Convolutional Layer: The convolutional layer employs filters, often denoted by \(k = 1, \ldots , N_k\), with dimensions \((L \times 1)\). These filters slide across the input sequence, applying a dot product operation at each position. This process enables the network to detect specific patterns or features. Compared to fully connected layers, convolutional layers are more resource-efficient, requiring fewer parameters. The output of this layer then undergoes a non-linear activation function (Indolia et al. 2018; Goodfellow et al. 2016).

  • Pooling Layer: The pooling layer follows the convolutional layer and serves to reduce dimensionality and emphasize essential features. It often uses operations like taking the maximum or average value over a defined segment of the data. This helps in reducing noise and retaining significant patterns (Indolia et al. 2018; Kiranyaz et al. 2021).

  • Processing Layer (Fully Connected or RNN/LSTM/GRU): This final layer takes the processed information from the preceding layers and makes predictions or classifications based on the task at hand. The choice between a fully connected layer and a recurrent layer like RNN, LSTM, or GRU depends on the specific requirements of the model (Hu et al. 2020; Yu et al. 2021; Dehghani et al. 2023).

The hierarchical structure of CNN makes it especially effective for noisy data. These networks filter out irrelevant noise layer by layer, focusing only on the significant patterns (Ismail Fawaz et al. 2019; Le et al. 2020). Their efficiency and feature-detection capabilities make CNN an appealing choice for applications such as time-series forecasting.

Training process

The training process for a DL model involves several steps. First, weight matrices U, W, V are initialized randomly. The optimization of these weights is performed using a gradient-based optimization algorithm, typically some variant of gradient descent. The aim is to find an optimal set of weights that minimizes a specified loss function. This iterative process continues until convergence, resulting in the smallest possible loss (LeCun et al. 2015; Goodfellow et al. 2016). For regression tasks, the Mean Squared Error (MSE) is often the loss function of choice.

$$\begin{aligned} \mathcal {L} = \frac{1}{N} \sum _{i=1}^{N} ( \hat{y}^{(i)} - y^{(i)} )^2 \end{aligned}$$
(13)

It calculates the average of the squared differences between the predicted outputs, \(\hat{\textbf{y}}^{(i)}\) , and the true outputs, \(\textbf{y}^{(i)}\). Here, N denotes the number of inputs. The loss function computes the error across all time steps, a reflection of both the sequential nature of the data and the architectural design of the RNN.

An essential part of the training process is hyperparameter tuning. It involves systematically exploring various hyperparameters to optimize the performance of the model. For RNNs, typical hyperparameters include the learning rate, batch size, number of hidden layers, number of units in each layer, and dropout rate. On the other hand, when dealing with CNNs, the hyperparemeters to tune are the kernel size, number of filters, stride, along with some of the hyperparameters commonly adjusted for RNNs. Various strategies such as trial and error, grid search, or random search can be employed to find the set of hyperparameters that provide the best performance on a validation set (Goodfellow et al. 2016; LeCun et al. 2015; Indolia et al. 2018).

Input variables selection

Selecting the optimal input subset is a critical component of implementing data-driven models that is often overlooked (Wu et al. 2014). The appropriate input selection maximizes the interdependence between input and output variables, which in turn improves model accuracy (May et al. 2011; Amaranto and Mazzoleni 2023). When the dynamics of the system are not fully understood, the application of input variable selection (IVS) becomes essential, although it significantly increases the computational requirements. Here, it is crucial to find an algorithm that balances computational efficiency with exploring the full input space (Amaranto and Mazzoleni 2023).

Among the possibilities for IVS, a common approach is to first train a model using only one input. Subsequently, additional input variables are incorporated in descending order of their occurrence in the past with the dependent variable. A new model is trained at each iteration with the updated set of input variables. This iterative process continues until the addition of new variables no longer contributes to an improvement in model accuracy (Amaranto and Mazzoleni 2023; Wu et al. 2014).

Monte Carlo Dropout

Monte Carlo Dropout (MCD) is technique introduced by Gal and Ghahramani (2016) for quantifying uncertainty in DL models. It provides a computationally efficient alternative for uncertainty assessment without necessitating any alterations to the existing model architecture. The key innovation in MCD is the employment of dropout layers during not just the training phase, but also the inference phase. Each time an inference is performed, MCD sets a random subset of weights to zero, thus leading to different output values. These diverse outputs capture the inherent variability in model predictions and offer an insight into their uncertainty. By averaging over T stochastic forward passes through the model during the testing phase, MCD yields an estimate of both the predictive mean and variance. This approach has been validated and adopted in other related DL studies (Althoff et al. 2021; Klotz et al. 2022).

Goodness-of-fit evaluation

After the training process, a model should be evaluated using previously unseen data. One common metric used for this purpose is the Nash-Sutcliffe Efficiency coefficient (NSE) (Nash and Sutcliffe 1970). The NSE is calculated by comparing the observed and simulated streamflow values using the following equation:

$$\begin{aligned} NSE = 1-\frac{\sum _{i=1}^{N}(y^{(i)}-\hat{y}^{(i)})^2}{\sum _{i=1}^{n} (y^{(i)}-\overline{y}^{(i)})^2} \end{aligned}$$
(14)

where N is the number of data points, \(y^{(i)}\) and \(\hat{y}^{(i)}\) denote observed and simulated discharge values at time i, respectively and \(\overline{y}\) represents the mean of the observed discharge series. An NSE of 1 signifies a perfect match between observed and simulated discharge values.

The Weighted Nash-Sutcliffe Efficiency (WNSE) is a modification of the NSE that weights the differences between observed and simulated discharge values according to the magnitude of the observed discharge values (Farfán and Cea 2022). The WNSE is calculated as follows:

$$\begin{aligned} WNSE = 1-\frac{\sum _{i=1}^{N}w_i(y^{(i)}-\hat{y}^{(i)})^2}{\sum _{i=1}^{N}w_i(y^{(i)}-\overline{y}_{\textbf{w}})^2} \end{aligned}$$
(15)

Here, \(\overline{y}_{\textbf{w}}\) is the weighted mean, calculated with the vector of weights \(\textbf{w}\). Then, \(w_i \in \textbf{w}\) is the weight for each time step i, which is calculated as:

$$\begin{aligned} w_i = \frac{(y^{(i)})^p}{\sum _{i=1}^{n}(y^{(i)})^p} \end{aligned}$$
(16)

where p is an exponent that allows the weighting of high versus low discharge values. Positive values of p enhance good peak flow predictions, while negative values of p give priority to a good fit in the low-flow region. Note that when \(p=0\), the conventional NSE is obtained.

Streamflow prediction

In the context of streamflow prediction, RNNs can be used to model the relationship between precipitation and previous streamflow data. The input to the network is a sequence of previous precipitation and streamflow observations, and the output is the predicted streamflow. By training the network on historical data, it can learn to forecast future streamflow values, potentially aiding in water management and flood prediction.

Additionally, streamflow prediction can be enhanced by including future rainfall forecasts. This approach involves incorporating future time steps in the rainfall forecasts equal to the lead time. By doing so, the model can better anticipate changes in streamflow based on expected precipitation, providing more accurate and reliable predictions.

Methodology and settings

Methodology

The methodology followed in this study is organized into three main blocks, as depicted in Fig. 3. The first block focuses on the configuration of the DL models and the setting of the lead times for predictions. The second block concerns the preparatory and operational steps needed for training the DL models. First, the dataset is divided into training, validation, and testing subsets. Then the IVS is performed in order to choose the most relevant input variables for making the forecast. Following IVS, the hyperparameter tuning is performed iteratively to optimize the model accuracy. Finally, the DL model is trained using the best-performing hyperparameters. The final block is focused on model evaluation. Here, previously unseen data from the testing set are used. The MCD methodology is applied to assess the uncertainty associated with the predictions of the model. The entire methodology has been executed iteratively, cycling through all the pre-defined DL models and lead times for further evaluation, comparison, and comprehensive assessment of their predictive capabilities.

Fig. 3
figure 3

Illustration of the three main blocks of the methodology: Model configuration, model training, and model evaluation

Fig. 4
figure 4

Data division of the rainfall and streamflow time series for the Groba and Anllóns catchments

Settings

This study evaluated six different DL models: LSTM, GRU, CNN, CNN-LSTM, CNN-GRU, and CNN-RNN. These architectures were assessed at four distinct forecasting lead times: 1 hour, 3 hours, 6 hours, and 12 hours.

The dataset was partitioned such that 80% was designated for model training and cross-validation, leaving the remaining 20% for model evaluation. This distribution is illustrated in Fig. 4. Furthermore, statistical summaries of both rainfall and streamflow data for the Groba and Anllóns catchments are provided in Tables 1 and 2. The data for training were normalized using the min-max scaling method to ensure uniformity in the model input values.

Table 1 Statistical description of rainfall (mm/h) and streamflow (m\(^3\)/s) data in the Groba Basin
Table 2 Statistical description of rainfall (mm/h) and streamflow (m\(^3\)/s) data in the Anllóns Basin
Table 3 Input Sequences for the DL models utilizing time-lagged variables Q (Streamflow) and R (Rainfall)
Table 4 Optimized settings of the DL models for the Groba Basin Post-IVS and Hyperparameter Tuning
Table 5 Optimized settings of the DL models for the Anllóns Basin Post-IVS and Hyperparameter Tuning

Then, if the objective is to forecast Q(t) using data from the two preceding time steps, the model utilizes Q(t-1), Q(t-2) , R(t-1), and R(t-2) as inputs. We have applied this approach in order to minimize computational demands given the various model architectures evaluated. The sequences of input variables for training the models are illustrated in Table 3. The IVS process stops when the inclusion of an additional previous time step results in a deterioration of the training process.

For determining the best performing input sequences, the IVS method applied consisted on the incremental incorporation of both streamflow (Q) and rainfall (R) data from the previous times steps. For instance, if the aim is to forecast

Q(t), the model is trained with the R and Q data from the prior time step: Q(t-1) and R(t-1). Then, an additional time step is added to the sequence (Q(t-1), Q(t-2), R(t-1), and R(t-2)) and the model is trained again. This process continues until adding an additional previous time step no longer contributes to, or worsens, the training performance. The sequences of input variables for training the models are illustrated in Table 3.

Furthermore, to enhance the streamflow prediction, future rainfall forecasts are included in the model. This approach incorporates future time steps of rainfall equal to the forecast lead time. For instance, if the objective is to predict Q(t+3), the model uses R (t), R(t+1), R(t+2), R(t+3) in addition to previous rainfall and streamflow data. This experiment seeks to allow the model to anticipate changes in streamflow more accurately based on predicted precipitation.

The training of the DL models was carried out using TensorFlow (Abadi et al. 2016). Hyperparameters were tuned through trial and error, considering different numbers of units, kernel sizes, activation functions, such as hyperbolic tangent and swish (Ramachandran et al. 2017), and the Adam optimizer (Kingma and Ba 2014). The number of epochs was set to 100. The input data was scaled to the range [0,1] using a min-max normalization method (Patro and Sahu 2015; Amaranto and Mazzoleni 2023). A dropout rate was applied to mitigate the risk of overfitting. This rate was explored within a range of 0.05 to 0.2. The MSE served as the loss function. The batch sizes in the experiments ranged from 128 to 1024 units for most models. An early stopping mechanism was also deployed, with a patience parameter set at 5 epochs.

For the LSTM models applied to both rivers, three layers were utilized. The number of units in these layers varied: for the Groba river, the unit range was primarily between 32 and 128 for the first and second layers, and between 8 and 16 for the third layer. Similarly, for the Anllóns river models, the third layer was fine-tuned within a range of up to 16 units. Across all LSTM models, the swish function served as the activation function. GRU models were configured with two layers. These were fine-tuned with a range of 32 to 128 units for the first layer and 4 to 64 for the second layer. A dropout rate of 0.1 was applied to minimize overfitting, and the swish function was used as the activation function.

CNN models for both rivers consisted of four layers. The number of filters ranged from 128 to 256, a range consistent across both rivers. Kernel sizes varied from 2 to 4, and padding was set to ensure the output dimensions matched the input dimensions. Hybrid models, blending features from both CNNs and RNNs, featured two convolutional layers. The number of filters tested ranged from 64 to 256, and kernel sizes varied from 2 to 4. Subsequent layers employed a varying number of units, ranging from 16 to 50.

The MCD methodology was implemented with a neuron dropout probability set at 0.1 and \(T=50\) stochastic forward passes for each trained model. These runs were used to calculate the mean of the predicted time series. Goodness-of-fit metrics were then computed based on this mean, and 95% confidence intervals were applied to provide insights into model uncertainty. The use of MCD serves two primary purposes: first, it addresses inherent model uncertainty; second, it aims to minimize biases in the analyses. The final structure of the applied models are synthesized in Tables 4 and 5.

Table 6 Training and cross validation results for the applied DL models in the Groba and Anllóns catchments

Results

Training and validation

A consistent trend of increasing errors was observed as lead times extended. This could be expected and likely stems from the accumulation of uncertainties as forecasts extend further in time. However, no signs of overfitting were observed, as evidenced by the relatively close alignment of training and validation error metrics for each model. For better illustration, the reader is referred to the appendix section, where the corresponding scatter plots for both rivers are presented. The loss function MSE with predictions made across lead times ranging from 1 to 12 hours (Table 6).

The difference in model performance across the two catchments occurs because the data has been calculated after denormalization. Therefore, the larger flow rates in the Anllóns may imply magnified errors, especially evident in the MSE metrics. However, this does not imply that the models were less accurate in the Anllóns river.

Goodness-of-fit evaluation

The predictive performance of the DL models was assessed using the NSE coeffcient and its variations, \(\mathrm {\overline{WNSE}_{high}}\) with \(p = 1\) and \(\mathrm {\overline{WNSE}_{low}}\) with \(p =-0.5\), for high and low flows, respectively. The evaluations were segmented into short-term, medium-term, and long-term predictions, each characterized by lead times of 1 hour, 3 to 6 hours, and 12 hours, respectively. These evaluations utilized average predictions from 50 simulations using the MCD technique.

In the context of 1-hour lead time, the models of the Groba basin exhibited significant predictive capabilities. The GRU model for this river yielded the highest NSE value at 0.969, with \(\mathrm {\overline{WNSE}_{high}}\) and \(\mathrm {\overline{WNSE}_{low}}\) values of 0.933 and 0.977 respectively indicating a robust performance. On the other hand, the CNN-RNN configuration for the same basin presented lower values, obtaining a NSE value of 0.933 and further lower high and low flows. In the Anllóns basin, all the models achieved NSE values close to 0.99, indicating strong predictive performance. Additionally, the \(\mathrm {\overline{WNSE}_{high}}\) and \(\mathrm {\overline{WNSE}_{low}}\) coefficients remained consistent across the six models. In both rivers the GRU model showed the slighly higher goodness-of-fit values than the remaining counterparts.

Fig. 5
figure 5

Deterioration of Goodness-of-Fit Values in the Groba and Anllóns Rivers: An evident decline in fit quality as lead time increases

For predictions with 3- and 6-hour lead times, the predictive performance in the Groba basin generally declined. The GRU model maintained its top performance, achieving an NSE value of 0.897 for a 3-hour lead time. However, at a 6-hour lead time, the CNN model surpassed other models with an NSE value of 0.796. Notably, there was a significant decline in \(\mathrm {\overline{WNSE}_{high}}\) values across all models (Fig. 5). In contrast, in the Anllóns basin, stable performance was observed, with the LSTM model showing slightly higher performance in the three coefficients. The CNN-LSTM configuration exhibited the lowest performance, with an NSE value of 0.960 at a 6-hour lead time.

For predictions with a 12-hour lead time, a more pronounced decline in performance was observed. In the Groba basin, the CNN model recorded an NSE value of 0.696. In the Anllóns basin, the CNN-LSTM registered the best values for NSE and the variations for high and low flows. At a 12-hour lead time, the models show the lowest performance, but still maintain a moderate fit as per the NSE scores. The CNN-RNN model also exhibits the lowest performance for the Anllóns River with an NSE score of 0.887.

The decline in predictive performance across different time horizons showed distinct patterns in the Groba and Anllóns basins (Table 5). The Groba basin experienced a significant decline in the \(\mathrm {\overline{WNSE}_{high}}\) high values, deteriorating from approximately 0.95 at a 1-hour lead time to around 0.2 at a 12-hour lead time. This steep decrease suggests substantial limitations in the ability of the model to capture the hydrological behavior in the Groba basin, particularly for the longer lead times.

In contrast, as it is illustrated in Fig. 5, the Anllóns basin showed a lower reduction in performance metrics. Here, WNSE high values started at nearly 0.99 for 1-hour predictions and declined to about 0.80 for 12-hour lead times. While there is a decline, it is comparatively less severe than in Groba, suggesting that the models applied to the Anllóns basin are more robust in capturing the higher flows over longer lead times.

Additionally, while LSTM and GRU are more effective in low flow predictions in the Groba basin, they perform almost equally well in both high and low flow conditions in the Anllóns basin. This observation is supported by the close alignment between the \(\mathrm {\overline{WNSE}_{high}}\) and \(\mathrm {\overline{WNSE}_{low}}\) values in the Anllóns basin, as opposed to the noticeable differences in the corresponding scores for the Groba basin. This might indicate that the models are less capable of accurately predicting high flow conditions in the Groba basin, a shortcoming that is further analyzed in the discussion “Discussion” section.

The hybrid models, such as CNN-LSTM, CNN-GRU, and CNN-RNN, do not outperform in general the simpler LSTM and GRU architectures. Moreover, the goodness-of-fit values among these models are relatively close, indicating that the choice between them might not be critical for the rivers under study.

Table 7 Goodness-of-Fit Evaluation for Groba and Anllóns Rivers: A comparison across the six DL models and four lead times using the testing set
Table 8 Mean Relative Error and Standard Deviation of Peak Flow Rates Exceeding the 95th Percentile Threshold, Evaluated Across the six DL Models and Lead Times

Analysis of peak flows

This section evaluates the performance of the six DL models in predicting peak flow values that exceed the 95th percentile threshold in the observed series.

Groba basin

In the Groba River the models generally yield acceptable predictions of peak flows at shorter lead times (1-hour). However, the accuracy of these predictions deteriorates as the lead time extends.

At a 1-hour lead time, the predictions are consistent with the observed values, with some notable exceptions. The CNN-RNN model showed a low mean relative error of -0.88%, although it also exhibits the highest standard deviation of 18.55%, indicating a high level of uncertainty. Conversely, the GRU model demonstrates a more balanced performance, with a mean relative error of -9.05% and the lowest standard deviation of 15.2%, suggesting slightly higher accuracy and consistency (Tables 7 and 8).

At a 3-hour lead time, the models generally maintain acceptable performance within the flow range between 2 and 4 \(m^3/s\) (Fig. 6). However, their predictive capability diminishes for flows exceeding this range. This behavior remains consistent across all the models, as the dispersion of the points and the distribution of the relative errors remain constant, with minimum errors ranging from -12 to -17% and standard deviations around 25% (Table 8). The CNN model shows a moderate level of performance at a 3-hour lead time, with a mean relative error of -12% and a standard deviation of 27.03%. Similarly, the CNN-LSTM model exhibits a mean relative error of -15.22% and a standard deviation of 26.12% at the same lead time, suggesting moderate performance.

Beyond the 6-hour lead time, there is a significant deterioration in predictive accuracy for peak flows. This decline is particularly evident in terms of the \(R^2\) coefficient, which shows near-zero values at a 6-hour lead time for all models, indicating a significant error distribution (Table 8). At this lead time, the models tend to underestimate some observed peak flows to near-zero values, as depicted in Fig. 6, indicating poor performance for peak storm events. The LSTM model presented the widest relative error distribution.

Finally, at a 12-hour lead time, the predictions are severely underestimated by the models, with several peak flows reaching predicted near-zero values. Notably, the maximum peak of the observed series is predicted at a value close to 0 (Fig. 6). Additionally, the distribution of the relative errors is wide (Fig. 8). For instance, the CNN-RNN model has the highest mean relative error of -35.71% at a 12-hour lead time, and its standard deviation is 37.13%. Similarly, the LSTM model not only has a high mean relative error of -26.77% at a 12-hour lead time but also a high standard deviation of 40.84%, indicating both inaccuracy and high variability. This suggests that the predictions of the models are consistently poor (Table 8). The reasons behind these mispredictions are explored in depth in “Discussion” section.

Fig. 6
figure 6

Comparison of observed vs. predicted peak flows in the Groba basin at various lead times: a) 1-hour, b) 3-hour, c) 6-hour, d) 12-hour

Fig. 7
figure 7

Comparison of observed vs. forecasted streamflow in the Anllóns basin at various lead times: a) 1-hour, b) 3-hour, c) 6-hour, d) 12-hour

Anllóns basin

In the case of the Anllóns River, the 1-hour predictions are very consistent and have a low standard deviation. The \(R^2\) coefficients are 0.96 for all models. The maximum values of the series are predicted very accurately and are closely aligned with the 1:1 line (Fig. 7). The best-performing model is the GRU model, which has the lowest mean relative error of 0.21% at a 1-hour lead time, with a standard deviation of 3.91%, indicating low uncertainty and good performance. On the other hand, the CNN-GRU model has the highest mean relative error (-4.42%) at a 1-hour lead time, but its standard deviation is 5.14%. Despite having the highest error, its predictive capabilities at this temporal horizon are still remarkable. Something similar happens with the CNN and CNN-RNN models, which have a wide distribution of relative errors, although on average they are around an error of 1% with standard deviations of approximately 5% (Table 8).

The results still show good performance for the 3-hour lead times. The fit to the 1:1 line is very consistent for all the models, with slight variations in the maximum values (Fig. 7). However, some differences are evident in the distribution of the relative errors. The mean of the LSTM model is approximately -5%, while for the CNN-GRU model, it is around 5%. The CNN model presents the least dispersed distribution of errors at this lead time. This is consistent with a underestimation of the peak flows (CNN-RNN), while others present slight overestimations (CNN, CNN-GRU). At this lead time, the errors range from -5.99% (LSTM) to 0.65% (CNN-GRU), with a consistent standard deviation across models between 5% and 6%. The CNN model has the least dispersed performance, with a mean relative error of 3.49% and a standard deviation of 4.86% at a 3-hour lead time, suggesting good performance with low variation. The CNN-LSTM model has the highest standard deviation of 6.34% at a 3-hour lead time, indicating the highest level of uncertainty among the models for this lead time (Table 8).

The results show a slight deterioration in performance for the 6-hour lead times. While the predicted peak flows present a good fit with respect to the 1:1 line, the predictions exhibit more dispersion at these lead times. This is particularly evident for the GRU model, which shows significant overestimation in one of the peak flows, as is shown in Fig. 7. The deterioration is also quantified by the reduction of the \(R^2\) coefficient, which decreased to 0.88. This still indicates good performance but is less impressive than at previous lead times. As expected, the distribution of the relative errors is wider than at previous lead times. Nevertheless, it presents a similar distribution between the LSTM, GRU, and CNN models, with a median around 5%. The CNN-LSTM, CNN-GRU, and CNN-RNN models are similar but have a median in lower relative errors, around -10%. In this case, the errors range from -6.51% to 0.65% (GRU), with deviations between 9% and 10%. The CNN-GRU model has the worst performance, with a mean relative error of -6.51% and a standard deviation of 8.1% at a 6-hour lead time, indicating high uncertainty. The best performance could be attributed to the LSTM model, with a mean relative error of -1.08% and a standard deviation of 9.26% (Table 8).

Table 9 Comparison of NSE and weighted NSE for high and low flow conditions in the Groba and Anllóns rivers, using GRU models trained with and without future rainfall data
Fig. 8
figure 8

Comparison of predicted versus actual streamflow in the Groba River basin across four forecast intervals: 1, 3, 6, and 12 hours. The upper row of plots represents models trained without future rainfall data, while the lower row includes models trained with future rainfall data. The color gradient in each plot indicates the absolute error magnitude

Regarding the 12-hour lead time, the models showed the poorest performance, which is expected. The peaks between 30 to 60 \(m^3/s\) started to show more deviated data with respect to the 1:1 line. Additionally, while the maximum peak flows were less accurate than at previous lead times, they maintained acceptable accuracy. It is also notable that the \(R^2\) coefficient reduced to values of 0.65 (LSTM and CNN-GRU), which could be considered acceptable but also confirms the deterioration in the results. In the distribution of the relative errors, similar results were achieved. However, the presence of outliers in the LSTM model indicates some poor performances, which are present because of the underestimations observed in the scatter plots (Fig. 7). Another point to take into account for this lead time is that the consistency of previous lead times started to fade out. For instance, the LSTM model presented the lowest mean errors with values of 4%, but it was the most dispersed of the models, while the CNN model presented a the highest mean error with -15% with the lowest deviation (16.74%).

Fig. 9
figure 9

Comparison of predicted and actual streamflow in the Groba River basin using GRU models. The top graph shows the performance of the model trained without future rainfall data, while the bottom graph includes future rainfall data in the training process

Fig. 10
figure 10

Comparison of predicted versus actual streamflow in the Anllóns River basin across four forecast intervals: 1, 3, 6, and 12 hours. The upper row of plots represents models trained without future rainfall data, while the lower row includes models trained with future rainfall data. The color gradient in each plot indicates the absolute error magnitude

Prediction with future rainfall

In this section, we analyze the impact of including future precipitation data in the sequences used for streamflow prediction to determine if this approach yields better results. The GRU model, which showed relatively stable outcomes in the initial phase, was applied across four forecasting horizons: 1, 3, 6, and 12 hours. The hyperparameters and sequences were the same as those optimized during the sequence adjustment phase and the selection of input variables for the corresponding lead times, with the inclusion of rainfall steps matching the duration of the lead times, used as a proxy for the utilization of predicted precipitation data.

Fig. 11
figure 11

Comparison of predicted and actual streamflow in the Anllóns River basin using GRU models. The top graph shows the performance of the model trained without future rainfall data, while the bottom graph includes future rainfall data in the training process

The inclusion of future rainfall data enhances the model’s performance. For example, in the Groba River basin at a 12-hour lead time, the NSE improved from 0.696 to 0.818 with the inclusion of future rainfall data (Table 9). This improvement is more evident under high flow conditions (\(\mathrm {\overline{WNSE}_{high}}\)), and particularly at longer time intervals. In the Groba River, an increase from 0.24 to 0.811 was observed. Additionally, although the coefficients show increases, Fig. 8 illustrates a slight improvement in the dispersion of points relative to the 1:1 line and a better fit of high flows as shown in the 12-hour hydrogram presented in Fig. 9.

In the case of the Anllóns River, an enhancement in the goodness of fit was also observed after including future rainfall in the input sequences to the models. While the results for prediction horizons of 1 to 3 hours are similar, for 6 and 12 hours, the results show notable improvements. For example, the \(\mathrm {\overline{WNSE}_{high}}\) coefficient increased from 0.811 to 0.919 for a 12-hour lead time (Table 9). Likewise, Figs. 10 and 11 demonstrate that the improvements in fits are more pronounced at lead times of 6 and 12 hours.

Discussion

From the results presented in “Results” section it is evident that no single model consistently outperforms the others across all scenarios. This suggests that the choice of model may be highly context-dependent, influenced by the specific characteristics of the river basin being studied. For instance, the GRU model showed the highest \(\mathrm {\overline{NSE}}\) values for short-term predictions in the Groba basin, while in the Anllóns basin, all models performed nearly equally well, with \(\mathrm {\overline{NSE}}\) values close to 0.99. The hybrid models, such as CNN-LSTM, CNN-GRU, and CNN-RNN, did not show a significant improvement over the simpler LSTM and GRU architectures. This suggests that the added complexity of these hybrid models may not necessarily translate into better predictive performance.

The inherent complexity of hydrological processes means that predictions become increasingly challenging as lead times extend. This is particularly evident for the Groba river, where model performance declines sharply with longer lead times, especially under high flow conditions. The steep drop in the \(\mathrm {\overline{WNSE}_{high}}\) values-from approximately 0.95 at a 1-hour lead time to around 0.2 at a 12-hour lead time-is significant. This decline, coupled with forecasts values close to minimum values for peak flows, suggests significant limitations in the ability of the models to capture the hydrological behavior of the Groba basin for lead times longer than 1 hour (Fig. 12). In contrast, the Anllóns basin demonstrates more stable performance across varying time horizons. Even at a 12-hour lead time, the \(\mathrm {\overline{NSE}}\) values only drop to above 0.80, indicating a much better fit (Fig. 13).

Fig. 12
figure 12

Visualization of the testing set for the Groba river, featuring observed values, mean forecast, and 95% confidence intervals at varying lead times: a) 1-hour, b) 3-hours, c) 6-hours, and d) 12-hours. The event data corresponds to November 24, 2017, at 20:00:00

Fig. 13
figure 13

Visualization of the testing set for the Anllóns river, featuring observed values, mean forecast, and 95% confidence intervals at varying lead times: a) 1-hour, b) 3-hours, c) 6-hours, and d) 12-hours. The data for the two events are from March 13, 2018, at 21:00:00 and March 15, 2018, at 03:00:00

To understand these differing outcomes, it is crucial to consider that we are performing streamflow forecasts relying only on the relationship between past rainfall and runoff data with future runoff. The DL models use sequences of this past data for their predictions. In these sequences, rainfall data informs the volume of water entering the system, while runoff data indicates the current volume flowing through it. Together, these variables fed the model with the information of the state of the basin at a specific time window, setting the stage for subsequent runoff predictions. Key hydrological characteristics such as basin area and slope significantly influence the dependencies between these input and output variables. The time of concentration, understood as the time it takes for water to flow from the most remote point in a drainage basin to its outlet, is a parameter that can synthesize these dependencies (Haan et al. 1994; Giandotti 1933).

In a basin with a smaller area, the time of concentration is shorter. This means that input variables like past rainfall or current runoff are relevant for the hydrological system for a shorter period compared to larger basins. In smaller basins, most of the rainfall volume exits the system within a few hours, rendering this information less useful for predicting future runoff. Similarly, the slope of the basin is a crucial factor. A steeper slope leads to higher flow velocity, which in turn results in shorter times of concentration de Obras Públicas y Urbanismo (1978); Giandotti (1933).

In the present study, we have examined two basins with distinct characteristics: Groba, with an area of 17.05 Km\(^2\), is considerably smaller than Anllóns, which has an area of 438 Km\(^2\). Additionally, Groba has a steeper mean slope compared to Anllóns (27.4% vs. 12.1%). These differences result in varying times of concentration for each basin, as detailed in Table 10.

In the case of the Groba basin, Fig. 12 reveals that predictions for lead times ranging from 3 to 12 hours become increasingly inaccurate. This trend can be better understood by considering that the concentration time for rainfall in this basin is approximately 2 hours. Beyond this time frame, the influence of rainfall on streamflow predictions starts to wane, becoming negligible for 3 to 12-hour lead time forecasts.

Figure 12 also shows that predictions only begin to show a small volume of water several hours after the event has already taken place. This delayed response leads to incorrect peak flow predictions, often close to minima values. The root of this error lies in the timing of the model: we are attempting to predict runoff 12 hours in advance, even though the current data within the model becomes irrelevant after 2-3 hours lead time. Moreover, the data crucial for accurate prediction only enters the system 1-3 hours before the event. Consequently, it becomes challenging for DL models to establish a robust relationship between input and output variables over such extended lead times.

Table 10 Times of concentration for the Groba and Anllóns basins using the methods by de Obras Públicas y Urbanismo (1978) and Giandoti Giandotti (1933)

On the contrary, in the Anllóns basin the accuracy of the predictions remains more consistent across the different lead times. Although the quality of the results does diminish with longer lead times, the predictions remain reliable up to a 6-hour lead time. For a 12-hour prediction, a delay can be observed in the alignment of the simulated data versus the observed data. This delay can be explained by the same reason mentioned above. However, since the Anllóns basin is larger than the Groba basin, this lack of information does not impact the results as severely as it does for the Groba River.

Even if more historical data from previous time steps was added in order to improve the predictions for longer lead times, that would not necessarily yield better results, because the sequence data only provides information about the state of the hydrological system within a time window shorter than the time of concentration. Therefore, adding more data to these sequences does not translate into more accurate models, as the additional data does not reflect the current state of the system. For example, in the case of the Groba basin, adding information from 15 previous time steps to predict streamflow at a 12-hour lead time does not improve the accuracy of the models. In such case, the relevant information for accurate predictions is not found in past time steps but in those that are near to the event. Consequently, adding more data only makes the models computationally expensive without improving their accuracy.

An additional experiment was developed in which future rainfall data was included for each lead time. The results showed that by incorporating future rainfall data, the GRU model gains insight into the times of concentration of the basin, enabling it to better understand and predict the flow dynamics over extended lead times. The model learns not only from past and present conditions but also anticipates future states, which is relevant for accurate predictions. The inclusion of future rainfall sequences allows the model to capture the lagged response of the basin to incoming precipitation, which in turn helps in better estimating the peak flow and its timing. As a result, the prediction accuracy improves, and the lead times of the forecasts are extended, making the model more robust and reliable for various hydrological applications.

Moreover, this method shows the potential of DL models for combining the model with various sources of prediction data in our study region. By providing the models with anticipated rainfall from weather models or remote sensing information, the model can utilize a wider range of data, boosting its predictive capabilities. This integration can be a significant next step, potentially improving the model’s ability to handle different hydrological scenarios and adapting to varying data availability.

Conclusions

The current study evaluated the performance of six different DL models for predicting streamflow across various temporal horizons in two basins located in NW Spain, Groba and Anllóns. The findings highlight the intricate nature of hydrological forecasting and offer both key insights and areas for further investigation.

It became evident that no single DL model consistently outperformed the best in all scenarios. This suggests that the best model is context-dependent. Notably, the GRU model showed promising results, particularly for short-term predictions in the Groba basin. On the other hand, the added complexity of hybrid models like CNN-LSTM, CNN-GRU, and CNN-RNN did not necessarily lead to better predictive performance, questioning its usefulness for this specific application.

The results provided insights into the significant challenges of maintaining predictive accuracy over longer lead times, especially in the Groba basin. This decline in performance is attributed to the relationship between the input variables and the specific hydrological characteristics of the basin. The time of concentration, influenced by factors such as basin area and slope, emerged as a critical parameter affecting the performance of the DL models as the forecast horizon increases.

The present work could serve as an initial approximation to the application of DL models for streamflow forecasting in the study region. Using satellite rainfall estimates instead of direct observations from rain-gauges could offer an alternative for streamflow forecasting in areas with absence of meteorological stations. Additionally, incorporating rainfall forecasts into DL models presents an opportunity to extend prediction lead times, a vital consideration for smaller basins. The outcomes of this work provide a comparative baseline for these future advancements.