Application of Machine Learning and Remote Sensing for Gap-filling Daily Precipitation Data of a Sparsely Gauged Basin in East Africa

Access to spatiotemporal distribution of precipitation is needed in many hydrological applications. However, gauges often have spatiotemporal gaps. To mitigate this, we considered three main approaches: (i) using remotely sensing and reanalysis precipitation products; (ii) machine learning-based approaches; and (iii) a gap-filling software explicitly developed for filling the gaps of daily precipitation records. This study evaluated all approaches over a sparsely gauged basin in East Africa. Among the examined precipitation products, PERSIANN-CDR outperformed other satellite products in terms of root mean squared error (7.3 mm), and correlation coefficient (0.46) while having a large bias (50%) compared to the available in situ precipitation records. PERSIANN-CDR also demonstrates the highest skill in distinguishing rainy and non-rainy days. On the other hand, Random Forest outperformed all other approaches (including PERSIANN-CDR) with the least relative bias (-2%), root mean squared error (6.9 mm), and highest correlation coefficient (0.53). Ways to fill in spatiotemporal gaps in gauge measurements were explored. Satellite and reanalysis, machine learning, and gap-filling software were investigated. Random Forest performed the best among all other methods to fill in gaps. Ways to fill in spatiotemporal gaps in gauge measurements were explored. Satellite and reanalysis, machine learning, and gap-filling software were investigated. Random Forest performed the best among all other methods to fill in gaps.

and evidence (type, amount, quality, and consistency;IPCC 2013). One of the most common problems in precipitation time-series analyses is the presence of gaps with different lengths (Bellido-jiménez et al. 2021). Gaps are due to erroneous manual data entry, equipment errors during the data collection, data loss due to defective storage technologies, and so on (Tannenbaum 2009).
Gap-free time series are required for statistical and trend analysis (Farhangfar et al. 2008;Shen et al. 2015;Li et al. 2019). Gap-filling methods can be used to fill in the missing data. Three categories of gap-filling methods are investigated in this study: (i) machine learningbased; (ii) precipitation products; and (iii) daily precipitation gap-filling software. Machine learning-based methods are the most versatile approach due to the availability of powerful algorithms and improving access to more data. Also, these methods can be calibrated locally based on available records. However, they need a significant amount of observations (Soley-Bori 2013). On the other hand, many precipitation products are available globally and can be easily accessed from online data sources. However, these products cannot be calibrated locally by end users. Finally, some software are developed exclusively for the gap-filling of daily precipitation based on the geostatistical and geospatial relationship among adjacent gauges. The inputs of such software are precipitation records and the location of stations. Machine learning-based imputation models have outperformed other approaches (Bellidojiménez et al. 2021), but their ability is often overlooked by the hydrological community (Gao et al. 2018).
The case study of this paper is Tanzania. Climate-related hazards such as droughts and floods are increasing in this country (United Republic of Tanzania 2012). Gap-free precipitation data is crucial for hydrological studies there considering the steady increase in population and limited water resources, especially for food security purposes. A proper understanding of spatiotemporal variations of precipitation is necessary to ensure sustainable water resources management (Mashingia et al. 2014). Despite the importance of gapfree precipitation time series, only a limited number of in situ observations in Africa are readily available to the Global Telecommunication Systems (GTS) global data archives (Nicholson et al. 2003). This will negatively affect the accuracy of global precipitation product (categories of gap-filling ii explained above). On the other hand, other categories for filling the gap of daily precipitation data (machine learning and software) have not been studied or compared in Tanzania.
This study investigates the performances of the three gap-filling approaches mentioned above. Random Forest (RF) and Fully Connected Deep Neural Network (FCDNN) algorithms are selected as machine learning-based methods. This is motivated by the results of previous hydrological studies (Bellido-jiménez et al. 2021;Portuguez-Maurtua et al. 2022;Kim and Ryu 2016). Also, well-known precipitation products, including Global Precipitation Climatology Centre (GPCC) V2020, Global Precipitation Climatology Project (GPCP) V1.3, Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks -Climate Data Record (PERSIANN-CDR), European Centre for Medium-Range Weather Forecasts Reanalysis V5 (ERA5), and Integrated Multi-satellitE Retrievals for GPM (IMERG) Final V6 (Table 1), are evaluated over the study area. Finally, Reconstruction of Daily Data -Precipitation (reddPrec) software (Serrano-Notivoli et al. 2017) is chosen as the representative of gap-filling software due to its acceptable performance in previous studies (Serrano-Notivoli et al. 2018;Navarro et al. 2020;Merino et al. 2021). This software enables obtaining serially complete precipitation datasets, estimating new data at ungauged locations, and/or creating regular grids of daily precipitation based on original data containing missing values or even large data gaps. In the upcoming sections, the study area, datasets, and methodology of each gap-filling approach is explained. Then, statistical metrics for evaluating the performance of the methods based on comparison against in situ records are presented. Finally, the best method among the examined ones is determined. This is derived based on the overall daily performance compared to daily precipitation observations. Also, spatial analysis on the accuracy has been carried to show how each method performs in different locations of the studied area. The best approach that fills the gap of daily precipitation data is then retained as the key outcome of the present study.

Study Area
In Tanzania, the extent of the ground-based precipitation network is not adequate to capture all the spatial rainfall variability (Mashingia et al. 2014). This country consists of nine main basins ( Fig. 1). There are two types of rain gauges: the non-recording type gives only total rainfall that occurred during a particular time, and the recording type gives hourly rainfall. Based on the World Meteorology Organization (WMO 2008(WMO , 2017 guidelines, the minimum density for non-recording rain gauges is between 250 and 900 km 2 per station (varying according to physiographic properties from mountains to coastal areas). Fifty-eight nonrecording rain gauges used in this study are mainly located in the Internal Drainage Basin (IDB; Fig. 1). IDB is the second biggest basin covering almost 20% of Tanzania (≈ 154,000 km 2 ), so the coverage of each station in this basin is above 2,600 km 2 per station. The annual evapotranspiration rate over this region is 2,000 mm. The climate of the studied area is mainly Tropical Savanna, and the seasons are divided into dry (June to October) and wet (November to May). The average annual precipitation in IDB ranges from 600 to 900 mm, but the northeastern part (near the border of Kenya) comes to more than 1,000 mm. Almost all the rivers in this region are seasonal and flow from December to July, but they are often dry for the rest of the year. In the central and northeastern parts of IDB, there are volcanoes such as Mt. Hanang, Kilimanjaro (the highest mountain in Africa), and Ngorongoro crater.
In the northern to the central part of IDB, several large lakes are located, such as Lake Natron, Lake Manyara, and Lake Eyasi (JICA 2008).

Datasets
A total of fifty-eight daily precipitation gauge ( Fig. 1) records were analyzed. Data were provided by the Ministry of Water in Tanzania. The quality control of data, as shown in Supplementary Material (SM) ( Figure SM1b), was perfomed based on the framework suggested in Wijngaard et al. (2003) and Ghajarnia et al. (2022). Statistical tests were utilized to exclude gauges with low quality data. Two gauges were excluded (more details on tests in the SM). In addition to in situ observations, five precipitation products were used in this study (Table 1). The current study was conducted for the 2000-2010 period because it has the highest overlap with precipitation products (limited by the availability of satellite products and gauge observations; Figure SM1a).

Fig. 1
Main nine sub-basins of Tanzania: 1-Internal drainage basin (IDB), 2-Lake Nyasa, 3-Lake Rukwa, 4-Lake Tanganyika, 5-Lake Victoria, 6-Pangani, 7-Rufiji, 8-Ruvuma South Coast, and 9-Wami Ruvu, with the location of rain gauges Three approaches of gap-filling were examined in this study. Utilizing: (i) the FCDNN and RF methods as two machine learning techniques; (ii) well-known precipitation products available globally; (iii) the reddPrec software developed for gap-filling of daily precipitation. Precipitation products are shown in Table 1. In the following sections, FCDNN, RF, and reddPrec are explained briefly. Table 2 summarizes the daily climate variables used as the inputs/features of the machine learning models. Models are trained with 70% of the available daily precipitation records (training set), while the hyperparameters are tuned over 15% of the records (validation set), and the remaining data is used for independent evaluation of the machine learning models and other gap-filling approaches (test set). These meteorological inputs are taken from the Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2) which has about 50 km spatial resolution.

Fully Connected Deep Nueral Netwroks (FCDNN)
FCDNN is a multilayer feed-forward neural network that is the simplest (Moghaddam et al. 2022) and one of the most common neural network forms (Partal and Kişi 2007). Each layer consists of several processing units (neurons). Each neuron is connected to adjacent layers with an individual weight assigned to each interlayer link. All inputs into a single neuron are multiplied by their associated weights and summed up to form a single output. Finally, each of these outputs is subject to a nonlinear transformation referred to as the activation function. As a result, FCDNN can be represented as a nested set of functions. It is the superposition of many simple nonlinear functions that enable FCDNN to estimate non-linear functions. FCDNN is fully connected, with each node connected to every node in the next and previous layer (Gardner and Dorling 1998). The number of layers, the number of nodes in each layer, the loss function, and the learning rate are among the hyperparameters that should be tuned for the FCDNN model in this study.

Random Forest (RF)
Random Forest (RF) was first introduced by Breiman (2001) as a supervised learning algorithm. Random forests are a combination of predictors (i.e., trees) such that each of them depends on the values of a random vector sampled independently and with the same distri-  bution for all trees in the forest. Internal estimates measure variable importance, and also monitor error, strength, and correlation utilized to show the response to increasing the number of inputs used in the classification. The number of trees and minimum sample split are the hyperparameters tuned for the RF model.

Gap-filling by the Reconstruction of Daily Data -Precipitation (reddPrec) Software
We selected the reddPrec software (www.cran.r-project.org/web/packages/reddPrec) because: (i) it applies comprehensive quality control over original daily precipitation datasets, and flags suspicious data based on five predefined criteria; and (ii) it fills missing values in original data series by estimating precipitation values using a number of nearest observations for each day. The reddPrec creates daily reference values using all the data recorded at the nearest stations for each targeted day. Multivariate logistic regression is used to compute these reference values based on the nearest neighbors and geographic and topographic variables as covariates. A threshold parameter is integrated to set a maximum distance (in km) to the search for the nearest neighbors (Serrano-Notivoli et al. 2017).

Evaluating the Performance of Gap-filling Approaches
30% of available daily precipitation data (15,075 observations) were randomly nulled and split as the validation and test sets. The remaining data were used for training machine learning (FCDNN, RF) methods and running the reddPrec software. The validation set was used for tuning the hyperparameters of the machine learning models. Finally, gap-filling approaches were compared over the test set. For consistency, precipitation products were compared with each other with the same validation data set. The best-performing precipitation product was selected based on several evaluation metrics. Finally, all approaches were compared with each other to select the best gap-filling approach in the study area.

Evaluation Metrics
The Pearson correlation coefficient (CC), the relative bias (Rbias), the root mean square error (RMSE), the probability of detection (POD), the false alarm ratio (FAR), and the Heidke skill score (HSS) metrics were used to evaluate the performance of each method. More details on each metric are presented in the Supplementary Materials file.

Results and Discussion
Among examined precipitation products, PERSIANN-CDR outperformed the rest in some evaluation metrics (compare Fig. 2d and Figure SM2). PERSIANN-CDR has the least RMSE (7.3 mm), the highest correlation coefficient (0.46), and the highest HSS among all products. Though other precipitation products have fewer Rbias than PERSIANN-CDR (e.g., Rbias of IMERG is -32% compared to that of PERSIANN-CDR which is -50%), PERSIANN-CDR was selected as the best-performing precipitation product because of better scores in other metrics.
Comparison of daily precipitation estimates by other evaluated methods (FCDNN, RF, and reddPrec) against gauge-based observations (Fig. 2) revealed that RF has the lowest RMSE (6.9 mm), the highest correlation coefficient (0.53), and least Rbias (-2%). Therefore, RF as a machine learning-based imputation method can improve evaluation metrics considerably compared to PERSIANN-CDR. This method is less biased than all examined precipitation products (Fig. 2c and Figure SM2). reddPrec has the least skill based on its high RMSE (14.2 mm) and low correlation coefficient (0.19).
Based on the histogram of observations in the validation set (Table SM1), 85% of daily precipitation values are less than 2.3 mm. Also, 90%, 95%, and 99% of precipitation records are less than 9.2, 16.1, and 39.1 mm, respectively. The frequency of precipitation above 51 mm is less than 0.5%. Most data in the scatter plots (Fig. 2) are inclined toward the x-axis (observation) for FCDNN, RF, and PERSIANN-CDR which explains the negative values for Rbias for these methods. Based on the Rbias equation (Eq. S3 in the SM), negative Rbias means that the model underestimates compared to observations. On the other hand, the red-dPrec method has a positive Rbias and the majority of data are inclined toward the y-axis (model estimate); thus, this method is overestimating daily precipitation.
The standard deviation of predicted daily precipitation records for FCDNN, reddPrec, RF, and PERSIANN-CDR is 8.1, 2.3, 13.6, 4.1, and 3.8 mm/day. FCDNN, RF and PER-SIANN have lower standard deviations than the observations. Estimates by these methods are mainly less than 20 mm/day while observations show that the frequency of precipitation between 20 and 50 mm/day is considerable (x-axis in Fig. 2). The standard deviation of red-dPrec is higher than the observations, and the points are more scattered toward the y-axis (Fig. 2b).
The reddPrec software has a parameter ("thres") to search nearest stations within a specific distance. If this parameter is set to "NA", the software will search 10 nearest observations without a distance limit (more detail in https://cran.r-project.org/web/packages/ reddPrec/index.html). In the IDB, 92% of the stations are located 0-400 km from each other (Table SM2). Therefore, we investigated the effect of different "thres" values (i.e., "NA", 50, 100, 200, 300, and 400 km) on the performance of the reddPrec. Based on Figure SM3, it was found that in the sparsely gauged IDB, introducing lower values for "thres" will improve reddPrec performance slightly. However, this minor improvement in the performance will compromise the number of gaps filled by reddPrec. When "thres" is defined as "NA", the majority of the validation set is filled by reddPrec (95%). However, in low values of "thres", less than 60% of gaps are filled. Thus, we decided to set "thres" to "NA" (results in Figs. 2, 3 and 4 are based on this assumption).
Calculation of the categorical metrics (POD, HSS, and FAR) using different precipitation thresholds (0, 0.1, 0.5, 1, 2, and 5 mm/day) for distinguishing between rainy and non-rainy days revealed critical information on the capacity of each method in detecting precipitation events (Fig. 3). Among the imputation methods, RF and FCDNN have the best POD (true hits divided by the sum of true hits and misses as shown in Eq. S4 in SM). However, in low thresholds (0-0.5 mm/day), RF has higher FAR (false alarms divided by the sum of true hits and false alarms as shown in Eq. S5 in SM) compared to other methods. As the threshold increases, the performance of the RF model improves indicating that the RF model has problems capturing low precipitation rates.
The HSS metric, aggregating the effect of FAR and POD was investigated to consider the effect of both directions in the contingency table (true hits and false alarms as well as true 1 3 8 Page 8 of 16 Application of Machine Learning and Remote Sensing for Gap-filling… hits and misses). Therefore, HSS is a good score to quantify the trade-off between FAR and POD. In the low thresholds (0-0.5 mm/day), the trade-off between the high value of FAR and POD for RF has led to the lowest values of HSS for this method. Although this method is powerful in correctly detecting rainy events (high POD), RF (high FAR) reports many days wrongly as rainy. Therefore, the overall strength of RF for distinguishing between rainy and non-rainy days is very low compared to other methods in low thresholds. Increas- ing the threshold increases RF performance considerably, so this method achieves the highest HSS for a 5 mm/day threshold among all methods.
The reddPrec method has consistently low POD and high FAR. Therefore, this method has the lowest HSS among others. On the other hand, FCDNN and PERSIANN-CDR have consistently high POD and low FAR (specifically in thresholds less than 2 mm/day). Thus, these methods have the highest HSS. However, in higher thresholds, the performance of FCDNN decreases since its FAR increases considerably compared to little improvement in POD. Based on the HSS value, the PERSIANN-CDR performs best when the threshold is set to 1 mm/day (the highest HSS is 0.64).
The spatial distribution of error for stations is displayed in Fig. 4. The majority of stations (62%) filled by the RF method (Fig. 4c) have an absolute error of less than 0.5 mm/ day. After RF, reddPrec (Fig. 4b), has the highest number of stations having low error (-0.5 to 0.5 mm/day shown by green points). However, for reddPrec, overestimation in other stations (big red points in Fig. 4b) has affected overall accuracy (Fig. 2b). FCDNN and PERSIANN-CDR mainly underestimate daily precipitation although the performance of PERSIANN-CDR is better (Fig. 4a and d). It is noteworthy that the mean error for all stations is very close to zero for RF as the most accurate model.
All examined methods in this research have limitations. Although reddPrec has the least accuracy among all other methods, the spatial distribution of error in the stations reveals that this method is more successful in filling the gaps of data in many locations compared to FCDNN and PERSIANN-CDR. The "thres" parameter is very important in reddPrec as it determines which adjacent stations can be used for gap-filling. IDB is a sparsely gauged basin, so the "thres" had to be set to "NA" to fill the highest ratio of gaps. This will introduce more uncertainty/error into the reddPrec method because distant stations are allowed to be utilized in the gap-filling process. Even though the "thres" was set to "NA", 5% of gaps in the test set were not filled by reddPrec. High error values in some stations may be attributed to what is described above.
The study area is a data-deprived region which has many gaps in the observations. In other words, we do not have a continuous time-series in gauges; hence, options for selection of machine learning algorithms are limited. Many algorithms such as convolutional neural networks could not be used because of lack of data, so we have included only a fully connected neural network in our models. Both machine learning-based methods use gridded meteorological parameters (Table 2 from MERRA-2). All precipitation products are similarly gridded data. Therefore, values of the nearest cell to each rain gauge were used to fill the gaps. Attributing the values of the whole cell to a point will introduce uncertainties/ errors in estimations, especially in a low-resolution dataset as the cell values represent the entire cell, not the gauge location. Finally, it should be noted that most of the daily precipitation records are less than 1 mm/day although the range of daily precipitation reaches 100 mm/day. The high concentration of low values (less than 1 mm) in records used for training RF and FCDNN led to low variance in their predictions. Thus, FCDNN and RF could not estimate precipitation higher than 10 and 20 mm/day, respectively. This can also be attributed to the limited number of samples representing higher precipitation rates in the training set, inaccuracy of the high precipitation rates in the reference set, and/or lack of related info in the feature to represent extreme precipitation rates.

Conclusions
This study investigated the performance of three precipitation gap-filling approaches over a sparsely gauged region in Tanzania: (i) precipitation products including GPCC V2020, GPCP V1.3, PERSIANN-CDR, ERA5, and IMERG Final V6; (ii) machine learning-based imputation approaches such as Fully Connected Deep Neural Network and Random Forest, and (iii) daily precipitation gap-filling software namely reddPrec. Based on available 1 3 Page 11 of 16 8 data in the study area, 2000-2010 was selected as the study period because of the highest overlap between in situ records and satellite/reanalysis data. To evaluate the performance of each approach, we utilized 30% of the available rain gauge records (15,075 observations) for validation and testing and the rest of the data was used for training of machine learning-based methods (category ii) and reddPrec software (category iii). Evaluation of precipitation products (category i) against the test set revealed that PERSIANN-CDR has the best performance compared to other examined precipitation products (the lowest RMSE, and the highest correlation coefficient). Also, PERSIANN-CDR has the best performance in detecting rainy days based on HSS among all examined gap-filling methods. However, other precipitation products are less biased than PERSIANN-CDR (based on Rbias). This study showed that machine learning-based gap-filling methods trained by meteorological data (from MERRA-2) have overall better performance compared to other methods. Random Forest has a lower bias than PERSIANN-CDR, and is the best-performing product in the study area. RF also has the lowest RMSE, highest correlation coefficient, and lowest bias among all examined methods/precipitation products. The main difference between the trained Random Forest model and global precipitation products (e.g., PERSIANN-CDR) is in the utilization of a higher number of in situ records in the training process. The accuracy of global precipitation products suffers from the lack of in situ data in the calibration process, especially in developing countries. In these countries, the contribution of in situ records in international data centers (e.g., Global Telecommunication System) is very low. Global Telecommunication System, as an example, is a source of calibration for many precipitation products. Consequently, the accuracy of precipitation products (e.g., GPCC used for bias adjustment in PERSIANN-CDR, GPCP, and IMERG) is negatively affected by the lack of in situ data. Funding This work is supported by the University of Oulu, Finland. Open Access funding provided by University of Oulu including Oulu University Hospital. Open Access funding provided by University of Oulu including Oulu University Hospital.