Monthly Streamflow Modeling Based on Self-Organizing Maps and Satellite-Estimated Rainfall Data

Hydrological data provide valuable information for the decision-making process in water resources management, where long and complete time series are always desired. However, it is common to deal with missing data when working on streamflow time series. Rainfall-streamflow modeling is an alternative to overcome such a difficulty. In this paper, self-organizing maps (SOM) were developed to simulate monthly inflows to a reservoir based on satellite-estimated gridded precipitation time series. Three different calibration datasets from Três Marias Reservoir, composed of inflows (targets) and 91 TRMM-estimated rainfall data (inputs), from 1998 to 2019, were used. The results showed that the inflow data homogeneity pattern influenced the rainfall-streamflow modeling. The models generally showed superior performance during the calibration phase, whereas the outcomes varied depending on the data homogeneity pattern and the chosen SOM structure in the testing phase. Regardless of the input data homogeneity, the SOM networks showed excellent results for the rainfall-runoff modeling, presenting Nash–Sutcliffe coefficients greater than 0.90.


Introduction
There are several uses for water resources, such as human supply, irrigation, animal watering, electricity generation, navigation, dilution of effluents, fishing, recreation, and landscaping (Collischonn and Dornelles 2015). Therefore, the effective management of water resources is essential to social and economic development. Moreover, the proper administration may mitigate the impacts of various natural phenomena, such as droughts that directly affect water disposal and the energy sector (Santos et al. 2019). Better water resources management includes understanding and analyzing water-related problems, such as inadequate supplies to meet water demands (e.g., drinking, sanitation, and power needs), flood damages, and water pollution (Loucks and Beek 2017). Thus, learning about the environmental variables that affect streamflow processes, especially rainfall, is a primary activity (Muhammad et al. 2018). The rainfall-streamflow relationship is a complex and non-stationary process, presenting non-linear spatiotemporal characteristics (Lettenmaier and Wood 1993). The mentioned complexity can be associated with the hydrological systems' uncertainties.
Self-organizing maps (SOM) are artificial neural networks (ANN) that cluster input data according to their similarities (Kohonen 1982;Haykin 1999;Farias and Santos 2014). Composed by a set of neurons, SOM networks use unsupervised machine learning techniques to organize data to preserve neighboring relationships. Also known as Kohonen neural networks, they are a cluster analysis and prediction method.
Several studies reported the successful application of SOM for modeling environmental and water resources systems. Garcıá and González (2004), for instance, applied SOM to understand the behavior of input variables in the operation processes of a wastewater treatment plant. Novarini et al. (2019a, b) used the clustering capacity of SOM networks in the modeling stages of optimal pressure management in water supply systems. Voutilainen and Arvola (2017) employed SOM to cluster complex environmental data from a small boreal lake. Yotova et al. (2021) used SOM to water quality assessment on a river catchment scale. Adeloye et al. (2011) used SOM to predict the reference evapotranspiration. Farias and Santos (2014) and Farias et al. (2015) used SOM networks for runoff-erosion modeling. Gao et al. (2021), Mannan et al. (2018), Nourani et al. (2013), and Ismail et al. (2012 employed SOM to identify spatially homogeneous clusters of precipitation. Zhang et al. (2022) also used SOM to cluster homogeneous regions for flash floods in China. Wang and Sun (2022) successfully analyzed the applicability of SOM for the statistical downscaling of regional daily precipitation in China. Li et al. (2020), Gholami et al. (2022), Wu et al. (2021, and Lee et al. (2021) characterized and analyzed groundwater quality based on SOM networks. Lastly, Adeloye and Rustum (2012), Farias et al. (2012Farias et al. ( , 2013, and Da Silva Filho and Farias (2018) are examples of the application of SOM for rainfallstreamflow modeling.
Responsible management of water systems depends on parsimonious models that use accessible and reliable input variables. Therefore, satellite rainfall products may be interesting for filling missing data and hydrological analysis. The lack of continuous and long rainfall time series is common in developing countries. According to Gadelha et al. (2019), the Brazilian rain monitoring network has approximately 11,820 rain gauges, resulting in a density of about one rain gauge per 720 km 2 , lower than that recommended by WMO (1994), i.e., one rain gauge per 575 km 2 .
Studies using SOM networks to simulate streamflow based on gridded satellite-estimated rainfall data and naturalized inflows for calibration are scarce worldwide. In this study, we simulated streamflow using Tropical Rainfall Measuring Mission (TRMM) rainfall for a reservoir in the São Francisco River basin, Brazil.

Study Area
The Upper São Francisco River basin, located in the northeast central part of Brazil, lies between latitudes 18.125°S and 20.875°S and longitudes 43.875°W and 46.625°W (Fig. 1). It is a sub-basin of the São Francisco River, considered the third most relevant Brazilian river basin (Santos and Morais 2013). The Upper São Francisco River basin has about 49,574 km 2 and, for example, is larger than the Netherlands, Denmark, and Switzerland. As for the relief, it is noteworthy that it has a wavy topography, with elevations ranging from 600 to 1600 m, as observed in Fig. 1.
The region has two seasons: a rainy summer (October-March) and a dry winter (April-September). According to Köppen's classification (Alvares et al. 2013), the local climate types of the region vary from Cwa (dry winter and hot summer) to Cwb (dry winter and temperate summer) and Aw (tropical zone with dry winter). The average annual rainfall varies from around 2,000 mm in the central-western part to about 1,000 mm in the north-eastern part (Santos et al. , 2019. About 85% of the annual rainfalls occur in the rainy season, and the average temperature is around 22 °C, with the evaporation around 1,000 mm per year (MMA 2017).
The local biome is the Cerrado, a seasonal ecosystem characterized by dry and wet seasons. Land use and cover include savanna vegetation, agriculture and pasture fields, urban areas, barren land, and forests ). According to Klink and Machado (2005), tall and dense evergreen gallery forests form the vegetation, ranging from closed to open canopy deciduous and semi-deciduous with a maximum height of 15 m.

Streamflow Data
The streamflow records used in this work were the naturalized monthly inflows to Três Marias Reservoir, located in the upper part of the São Francisco River basin (Fig. 1a). Naturalized inflow means the quantity of water that would have entered the reservoir for which there have been no effects caused by diversion, storage, import, export, return flow, or changes in consumptive use. The Brazilian National System Operator (ONS; Operador Nacional do Sistema, available at: http:// www. ons. org. br), responsible for the operation of the Brazilian Interconnected System, provided such naturalized monthly inflow data, composed of 264 values from January 1998 to December 2019 (Fig. 2b). The obtained streamflow time series presented maximum and minimum values of 3,016.0 m 3 /s and 6.8 m 3 /s, respectively. It is worth noting the inhomogeneity of the streamflow time series. According to the Pettitt test (Pettitt 1979), there was an abrupt change in the time series in April 2013. Table 1 shows the main statistics of these data for the first and second hydrological period, i.e., 01/1998-04/2013 and 05/2013-12/2019, and for the entire time series, i.e.,

TRMM Rainfall Data
The TRMM was a joint mission between the National Aeronautics and Space Administration (NASA) and the Japan Aerospace Exploration Agency (JAXA), which provided critical precipitation measurements in the tropical and subtropical regions of the Earth (Plouffe et al. 2015;Teng et al. 2016;Santos et al. 2019). The monthly rainfall data were obtained at each 0.25° from 46.625°W to 43.875°W and 20.875°S to 18.125°S, considering only the grid-cell inside the study area's boundary, in a total of 91 time series (Fig. 1b). Each time series had 264 monthly records, referring to the monthly period between January 1998 and December 2019. Figure 2a also depicts the hyetograph based on TRMM-estimated rainfall data, considering the Thiessen-weighted average over the river basin ( Fig. 1a). At the same time, Table 1 shows the descriptive statistics for the inflow and average rainfall data.

Self-Organizing Maps
A SOM network is a competitive learning ANN model that performs a kind of vector quantization that captures the topology of the input space (Kohonen 1982). The main objective of a SOM is to transform an arbitrary-sized input signal pattern into a discrete one-or-twodimensional map. SOM models cluster the standard input data to represent similar patterns by the same output neurons or one of their neighbors. They consist of two layers: the input and the SOM or output layers. It is important to note that the grouping and mapping performed by the SOM network preserve the original topology of data, keeping essential information during the process. In this sense, each output neuron connects every input unit, and the weights of such links define a prototype vector (i.e., codebook vector) in the input space. Initially, the input data is normalized (i.e., the mean value is zero, and the variance is one). This stage is necessary to permit all the features to move in roughly the same ranges and, possibly, be treated by the SOM in the same way. The weights represent the link strength w ij between input neuron j and output neuron i. For training, the model computes the Euclidean distances ED i between the input vector x j and weights w ij connected to each output neuron j via: where x j is the j-th component of the input vector x; n is the dimension of the input vector x; I ranges from 1 to M, with M being the total number of output neurons of SOM layer; and m j is the function mask, usually assumed as one. Values of m j may be zero to exclude the contribution of the element x j in the calculation of the Euclidean distance, widely used when the input variable has missing values. The batch training algorithm was chosen between the two existing training types (incremental and batch modes) because it is generally much faster than the incremental one. The algorithm presents input vectors to the network using the whole dataset before updating any weight. In this sense, the search for the winning neuron is done for each input vector, and then the weight vector is moved to the average position of all input vectors for which there is a winner (best matching unit -BMU) or a neighbor of a winner.
The winner neuron and its neighbors are selected after the Euclidean distance is computed for all input vectors. In this method, the weights tend to stabilize after several presentations of the dataset. It is worth noting that this network training is unsupervised since there is no target outputs. In this training phase, the output neuron whose weight vector most closely matches the input data vector (i.e., the Euclidean distance is the lowest value) is selected as the winning neuron. In the present study, the Kohonen rule (Beale et al. 2012) updates the weights connected to each winner and neighbor neurons in a particular neighborhood radius.
The network training usually occurs in two distinct phases: ordering and tuning. In the first phase, training is limited by a given number of presentations of the complete dataset. The radius of the neighborhood starts with a given distance that decreases to the unit value. This measure allows the neuron weights to consistently organize themselves in the input space with neuron positions in the dimensional grid. The tuning phase uses the remaining number of presentations established for training. At this stage, the radius of the neighborhood is below unity, meaning that only the winner neuron weight is updated. Weights are expected to be modified relatively evenly in input space during this phase, maintaining the topology defined in the ordering phase (Beale et al. 2012).
Finally, it can be summarized that once the SOM network is trained, it is possible to use the model as a tool for forecasting or calculating variables. For this purpose, (i) the Euclidean distances between the input vectors and weights attached to the output neurons are calculated, disregarding the element j to be supplied, which is done by including a Boolean variable m j , as shown by Eq. (1). The variable m j is used to include (m j = 1) or exclude (m j = 0) the contribution of a given element j from the input vector to the calculation of Euclidean distances; (ii) the winning neuron is determined based on the shortest Euclidean distance; and (iii) the weight of the winning neuron connected to the missing element j of the input vector is used as the predicted value.

Datasets and Models
Three datasets were used, and three models were established for each dataset. The datasets correspond to three different study periods chosen due to the inhomogeneity of the complete streamflow time series, as discussed in Sect. 2.2. Dataset #1 was chosen to comprise the 1998-2019 period, dataset #2 related to the 1998-2014 period, whereas dataset #3 corresponded to the 2014-2019 period. Each of the three datasets was sub-divided into two sub-sets with 70% for calibration and 30% for testing. However, a hydrological time series may not be stationary, which difficulties the calibration of models and their future applications. Therefore, the division of the streamflow time series for calibration (70%) and testing (30%) was carried out in three different ways, as follows: (i) the calibration dataset comprising the first 70% of the time series (A), (ii) the calibration dataset comprising the last 70% of the time series (B), and (iii) the calibration dataset comprising the first and the last

SOM Architecture
The number of neurons M was determined based on García and González (2004) according to the following equation: where N is the total number of samples used in the calibration phase.
(2) M = 5 √ N Fig. 3 Division of the datasets for the rainfall-streamflow models The optimum number of neurons M for each dataset, i.e., input vectors, was calculated as ∼67 for dataset #1, ~56 for dataset #2, and ~36 for dataset #3. A square structure was chosen to represent the SOM grids. Thus, the 9 × 9, 8 × 8, and 6 × 6 grids were chosen with a hexagonal structure to be used in this work. The input data were scaled to improve the training efficiency of the SOM model. The scaling process normalizes the inputs to have a zero mean and unitary standard deviation (Beale et al. 2012). The batch mode algorithm was used during the calibration phase, and the datasets were presented to the SOM structure 3,000 times to ensure consistent learning. For the ordering phase, 100 presentations of the dataset, an initial neighborhood radius equal to three steps, and a learning rate of 0.90 were used. The tuning phase lasted the 2,900 remaining presentations, using a neighborhood distance inferior to one and a learning rate of 0.02. The SOM models were implemented using the Neural Network Toolbox of MATLAB R2020b.

Evaluation Metrics
The estimation accuracy of the models was quantified using the coefficient of determination (R 2 ), Nash-Sutcliffe efficiency (NSE) (Nash and Sutcliffe 1970), percent bias (RBIAS), and Normalized Root Mean Square Error (NRMSE), as shown in Eqs. (3)-(6). Such statistical metrics were chosen because they are widely used in hydrological analyses (Moriasi et al. 2015). In this study, the normalized version of the RMSE was chosen to compare the results from the different models due to the different orders of magnitude of the data in the time series (Viljanen et al. 2018).
(3) where O i are the observed values, C i are the values generated by the SOM models, O is the mean of observed values, C is the mean of the values generated by the SOM models, and s is the sample size. The coefficient of determination R 2 evaluates the model performance, with its values varying from 0 to 1, where the optimal value is 1. The NSE assesses the adjustment between observed and estimated data, ranging from −∞ to 1 with an optimal value equal to 1. RBIAS determines how well the model simulates the average magnitudes for the calculated data, varying from −∞ to ∞ with an optimal value equal to 0. Positive values indicate that the model is generally overestimating the measured data, and negative values indicate the opposite (the model is underestimating the values). NRMSE is the normalized version of the root mean square error, where lower NRMSE values would indicate higher accuracy, i.e., the best value is 0.

Calibration Phase
The performance of the SOM rainfall-streamflow models was evaluated using the four described metrics, i.e., R 2 , NSE, RBIAS, and NRMSE, for both calibration and testing phases. For the calibration phase, the metrics results are shown in Table 3, while the observed and calculated inflows are plotted in Fig. 4. In general, satisfactory results can be seen in this calibration phase, showing excellent correspondences between the maximum and minimum values for the analyzed time series. All models presented R 2 and NSE The best values of the obtained metrics are related to the D1C, D2C, and D3C models, which are the approaches that used the calibration dataset formed by the first and last 35% of the original time series. Dataset #3 also presented excellent results for R 2 and NSE indices with values close to 1 concerning D3C during the calibration phase. Then, it seems that when the data for calibration includes different hydrological patterns, they do not interfere with the results but rather improve the model performance. It is also worth highlighting that SOM performed well even using short time series (i.e., D3A, D3B, and D3C), presenting, in this case, the best metrics results. Dataset #3 has only 56 months for the calibration phase. Figure 5 shows the component planes for all models considering streamflow and rainfall from grid-cell #39. According to Beale et al. (2012) and Farias and Santos (2014), the component planes -which represent the weights associated with each input variable -allow the detection of similarities among variables. Lighter (yellow) and darker (black) colors respectively describe larger and smaller weights. The color gradient on each component plane appoints a direct (parallel gradients) or inverse (antiparallel gradients) correlation between two variables (García and González 2004). In Fig. 5, all models show that higher and lower flows are more directly related to current rainfall P(t) than any other input variables.

Testing Phase
The metrics related to the testing phase are shown in Table 4. Figure 6 shows a graphical comparison between the observed streamflow values and those calculated by the SOM Fig. 4 Hydrographs for the observed and calculated streamflow for the three datasets and the three different models applied during the calibration, i.e., dataset #1 (1998-2019), dataset #2 (1998-2014), and dataset #3 (2014-2019). Three models developed for each dataset: A = initial 70% for calibration; B = final 70% for calibration and C = divided 70% for calibration In general, unlike for the calibration phase, the results of the metrics vary and, therefore, are thoroughly discussed. For dataset #1 (1998-2019), the worst metric values were observed during the calibration using the first 70% of the streamflow time series (D1A). The R 2 and NSE coefficients presented values of 0.126, the RBIAS was 0.507, and the NRMSE was 0.87 (close to 1). As dataset #1 comprises the entire original streamflow time series, which has two different hydrological patterns (638.4 m 3 /s and 269.4 m 3 /s), such a pattern interferes with the simulation results. The performance of D1A, for example, can be classified as not satisfactory, according to Moriasi et al. (2015). However, when analyzing the same dataset #1, considering the last 70% of the time series for calibration (D1B) or considering the first and last 35% of the time series for calibration (D1C), the models' performances can be classified as good in general, with D1C presenting the following values: R 2 and NSE = 0.720, RBIAS = 0.047, and NRMSE = 0.561. Such an outcome shows that, although two patterns exist within the time series, models D1B and D1C could interpret such a difference and provide good simulation results. Even though the evaluation of SOM networks on dataset D1A provided poor metrics, according to Fig. 6a, the calculated series presents, in general, the same order of magnitude and patterns of increases/decreases exhibited by the observed data. In water resources management analysis, where only the magnitude order or the expected behavior of increase/ decrease is necessary, the model can still be effective. For cases that require a more precise value, the model may not be adequate. For dataset #2, which uses data referring to the period of the first hydrological pattern identified in the complete series, i.e., 1998-2013, the results show a particular variation in the derived metrics. The metrics obtained for the model D2A were considered appropriate, with R 2 and NSE equal to 0.689, RBIAS of −0.118, and NRMSE of 0.492. The values of R 2 and NSE are reasonable, as is the RBIAS, which is only slightly greater than 0.10; however, the NRMSE has a relatively high value, indicating a sensible difference between the observed and calculated streamflows.
Regarding the D2B model, the metrics are mostly better than those found for the D2A approach, with R 2 and the NSE showing values of 0.749, which indicate a good correlation and correspondence between observed and calculated flows, respectively. The D2C model can be highlighted with R 2 and NSE equal to 0.805, revealing a high correlation and similarity between calculated and observed values. It is worth noting that the generated RBIAS of −0.060 by model D2C, which is close to 0, corroborates with the previous metrics, indicating that the calculated values, in general, neither underestimated nor significantly overestimated the observed results. In addition, the NRMSE decreased to 0.417.
Finally, for dataset #3, which comprises the period from 2013 to 2019, it is possible to highlight satisfactory results. For the D3A model, the metrics revealed increased values of R 2 and NSE (0.821), showing excellent correlation and equivalence between calculated and observed flows; an RBIAS of −0.165, which is relatively low; and a not too high value of NRMSE (around 0.335). In the D3B model, the R 2 and NSE decreased to 0.757; RBIAS reduced, in absolute terms, to 0.082; and the NRMSE slightly increased to 0.472. The D3C model also presented high-performance results in this testing phase, with R 2 and NSE equal to 0.720, showing a high correlation between observed and calculated values. The RBIAS close to zero (0.047) suggests that, in general terms, the simulated flows by the D3C model presented correspondence to observed values. In addition, the NRMSE showed a value close to 0.50. Figure 7 displays the SOM hit maps for the testing phase. The hit maps are plots of the SOM layer, in which the output neurons reveal the number of input vectors classified by them. Except for model D1A, the networks activated output neurons with even distribution and explicit separation of clusters. This outcome indicates that the methodology used to define the SOM's size was efficient. Moriasi et al. (2015) provide a classification for the values of R 2 , NSE, and RBIAS metrics, widely used in the water resources area. In such ranking, considering a monthly scale, R 2 values are considered "very good" above 0.85, "good" between 0.75 and 0.85, "satisfactory" between 0.60 and 0.75, and "unsatisfactory" when it is less than 0.60. For the NSE, the values are considered "very good" when above 0.80, "good" between 0.70 and 0.80, "satisfactory" between 0.50 and 0.70, and "unsatisfactory" if less than 0.50. Finally, for RBIAS, the absolute values are considered "very good" when below 0.05, "good" between 0.05 and 0.10, "satisfactory" between 0.10 and 0.15, and "unsatisfactory" if greater than 0.15. When comparing the classification provided by Moriasi et al. (2015) with the performance of the SOM networks, it is noteworthy that the results generally concentrated between the ranges of "good" to "very good". Only the D1A model showed "unsatisfactory" values for the three metrics (R2, NSE, and RBIAS), probably due to the inhomogeneity of the time series.

Discussion
For the three studied cases, all subsets that used the first and last 35% of the data for calibration (D1C, D2C, and D3C) presented "satisfactory" to "very good" results of R 2 and NSE for the testing phase, and "very good" outcomes for all metrics considering calibration. Since neural networks learn from examples, they usually present low extrapolation performance. On the other hand, they are well suited for interpolation. The found metrics show high R 2 and NSE values and low RBIAS and NRMSE, confirming the good agreement between observed and simulated streamflows. The outstanding performance of the SOM models using satellite-based rainfalls confirms that this technique is a feasible option for rainfall-streamflow modeling and hydrological analyses, especially in regions with ungauged catchments.

Conclusions
This study demonstrated the efficiency and practicality of using TRMM rainfall products and Self-Organizing Maps for modeling monthly streamflow. The methodology was applied and validated over three datasets of inflows to a reservoir in the São Francisco River basin, Brazil. From the nine models proposed, eight presented statistical metrics that could be ranked from "good" to "very good" in the classification suggested by Moriasi et al. (2015). In general, the SOM networks trained with the first and the last 35% of the dataset (models D1C, D2C, and D3C) derived better outcomes, confirming their abilities to predict missing values, mainly for hydrological regions with precarious instrumentation. The results also showed that predictions were accurate even when using short time series, as observed in models D3A, D3B, and D3C.
The SOM's easiness for analysis and understanding of data (via maps of plane components) and the possibility of using incomplete input data for prediction purposes proved valuable for managing water resources. The presented models can also be tools for handling missing data in poorly monitored catchments, enabling studies and projects to be better developed.
The use of satellite-estimated rainfall data, which are promptly available, showed to be a viable alternative to rain gauge stations, which are especially limited in remote areas. In conclusion, this methodology could be a successful option for other catchments, allowing the hydrological modeling of different variables at several time scales.