1 Introduction

Precipitation is probably the most important component of the hydrological cycle (Eltahir 1998; Oki and Kanae 2006; Saemian et al. 2021; Sun et al. 2018; Ghajarnia et al. 2022). In addition to being the main source of renewable water, precipitation is also critical for the socioeconomic development of nations, especially African countries that depend on rain-fed agriculture (Dinku et al. 2007; Awange et al. 2016). In recent years and due to climate change, most African regions experienced high precipitation variability that led to frequent drought and floods. A recent study (Alahacoon et al. 2021) has shown that the majority of African countries have experienced a significant variation in long-term (1983–2020) precipitation. Therefore, reliable and consistent precipitation estimates for water resources monitoring are vital in Africa (Awange et al. 2016). Nonetheless, studies evaluating precipitation products (with reference to precipitation gauges) in Africa are limited by the lack of in situ observations (Awange et al. 2016; Romilly and Gebremichael 2011; Owusu et al. 2019; Echeta et al. 2022; Logah et al. 2021; Pérez-Alarcón and Fernández-Alvarez 2022).

Accurate estimation of precipitation is challenging (Adhikari et al. 2020; Foufoula-Georgiou et al. 2020; Akbari et al. 2022) due to different sources of uncertainty associated with different estimation methods and high spatiotemporal variability in complex topographical regions (Beck et al. 2019; Adhikari and Behrangi 2021; Akbari et al. 2020). Due to the localized nature of precipitation patterns, especially for extreme events, precipitation gauges provide the most accurate precipitation measurements (Sun et al. 2018). However, they are geographically sparse, especially in remote areas (Huffman and Bolvin 2013; Akbari et al. 2019), and suffer from sample measurement, and under catch errors (Rasmussen et al. 2012; Ehsani and Behrangi 2022). Remote sensing (RS) precipitation products mitigate some of the shortcomings of the gauges by incorporating observations from thermal infrared, passive microwave, and radar instruments (Huffman and Bolvin 2013). However, remote sensing products may have significant biases due to systematic and random errors in their retrieval algorithms (Ehsani et al. 2021), inadequate spatial and temporal sampling (Sun et al. 2018), relatively poor performance over snow and ice surfaces (Ferraro et al. 2013; Rahimi et al. 2022), and relatively short data records (Sadeghi et al. 2019). Reanalysis precipitation products benefit from data assimilation systems that incorporate available observations (in situ and remotely sensed data) into numerical models (Morales-Moraga et al. 2019). Although reanalysis products enable an extended temporal estimation (e.g., over 40 years) of precipitation (Morales-Moraga et al. 2019), their reliability depends on observational constraints, which can vary significantly over space and time (Dee et al. 2016; Rahimi et al. 2021). Ground-based precipitation radars provide near real-time coverage with a high spatiotemporal resolution (Sokol et al. 2021). However, they are only available in a few countries and have some limitations such as the interference of the earth’s curvature with the beam at long distances (Sebastianelli et al. 2013), and the high cost of the equipment (Sokol et al. 2021). All precipitation products including RS, radars, and reanalysis depend on gauges records for high-quality retrievals and bias adjustment (Sunt et al. 2018).

Accurate precipitation estimation is crucial for climate studies, trend analysis, water resources management, hydrological forecasting, and so on (Jiang et al. 2012; Liu et al. 2017). However, precipitation observations are qualified by the IPCC as of medium confidence (IPCC 2013). The confidence metric provides a qualitative synthesis of the IPCC expert team’s judgment about the validity of a finding based on the level of agreement and evidence (type, amount, quality, and consistency; IPCC 2013). One of the most common problems in precipitation time-series analyses is the presence of gaps with different lengths (Bellido-jiménez et al. 2021). Gaps are due to erroneous manual data entry, equipment errors during the data collection, data loss due to defective storage technologies, and so on (Tannenbaum 2009).

Gap-free time series are required for statistical and trend analysis (Farhangfar et al. 2008; Shen et al. 2015; Li et al. 2019). Gap-filling methods can be used to fill in the missing data. Three categories of gap-filling methods are investigated in this study: (i) machine learning-based; (ii) precipitation products; and (iii) daily precipitation gap-filling software. Machine learning-based methods are the most versatile approach due to the availability of powerful algorithms and improving access to more data. Also, these methods can be calibrated locally based on available records. However, they need a significant amount of observations (Soley-Bori 2013). On the other hand, many precipitation products are available globally and can be easily accessed from online data sources. However, these products cannot be calibrated locally by end users. Finally, some software are developed exclusively for the gap-filling of daily precipitation based on the geostatistical and geospatial relationship among adjacent gauges. The inputs of such software are precipitation records and the location of stations. Machine learning-based imputation models have outperformed other approaches (Bellido-jiménez et al. 2021), but their ability is often overlooked by the hydrological community (Gao et al. 2018).

The case study of this paper is Tanzania. Climate-related hazards such as droughts and floods are increasing in this country (United Republic of Tanzania 2012). Gap-free precipitation data is crucial for hydrological studies there considering the steady increase in population and limited water resources, especially for food security purposes. A proper understanding of spatiotemporal variations of precipitation is necessary to ensure sustainable water resources management (Mashingia et al. 2014). Despite the importance of gap-free precipitation time series, only a limited number of in situ observations in Africa are readily available to the Global Telecommunication Systems (GTS) global data archives (Nicholson et al. 2003). This will negatively affect the accuracy of global precipitation product (categories of gap-filling ii explained above). On the other hand, other categories for filling the gap of daily precipitation data (machine learning and software) have not been studied or compared in Tanzania.

This study investigates the performances of the three gap-filling approaches mentioned above. Random Forest (RF) and Fully Connected Deep Neural Network (FCDNN) algorithms are selected as machine learning-based methods. This is motivated by the results of previous hydrological studies (Bellido-jiménez et al. 2021; Portuguez-Maurtua et al. 2022; Kim and Ryu 2016). Also, well-known precipitation products, including Global Precipitation Climatology Centre (GPCC) V2020, Global Precipitation Climatology Project (GPCP) V1.3, Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks - Climate Data Record (PERSIANN-CDR), European Centre for Medium-Range Weather Forecasts Reanalysis V5 (ERA5), and Integrated Multi-satellitE Retrievals for GPM (IMERG) Final V6 (Table 1), are evaluated over the study area. Finally, Reconstruction of Daily Data - Precipitation (reddPrec) software (Serrano-Notivoli et al. 2017) is chosen as the representative of gap-filling software due to its acceptable performance in previous studies (Serrano-Notivoli et al. 20182017; Navarro et al. 2020; Merino et al. 2021). This software enables obtaining serially complete precipitation datasets, estimating new data at ungauged locations, and/or creating regular grids of daily precipitation based on original data containing missing values or even large data gaps. In the upcoming sections, the study area, datasets, and methodology of each gap-filling approach is explained. Then, statistical metrics for evaluating the performance of the methods based on comparison against in situ records are presented. Finally, the best method among the examined ones is determined. This is derived based on the overall daily performance compared to daily precipitation observations. Also, spatial analysis on the accuracy has been carried to show how each method performs in different locations of the studied area. The best approach that fills the gap of daily precipitation data is then retained as the key outcome of the present study.

Table 1 Evaluated precipitation products in this study over the IDB

2 Study Area, Methodology, and Datasets

2.1 Study Area

In Tanzania, the extent of the ground-based precipitation network is not adequate to capture all the spatial rainfall variability (Mashingia et al. 2014). This country consists of nine main basins (Fig. 1). There are two types of rain gauges: the non-recording type gives only total rainfall that occurred during a particular time, and the recording type gives hourly rainfall. Based on the World Meteorology Organization (WMO 2008, 2017) guidelines, the minimum density for non-recording rain gauges is between 250 and 900 km2 per station (varying according to physiographic properties from mountains to coastal areas). Fifty-eight non-recording rain gauges used in this study are mainly located in the Internal Drainage Basin (IDB; Fig. 1). IDB is the second biggest basin covering almost 20% of Tanzania (≈ 154,000 km2), so the coverage of each station in this basin is above 2,600 km2 per station. The annual evapotranspiration rate over this region is 2,000 mm. The climate of the studied area is mainly Tropical Savanna, and the seasons are divided into dry (June to October) and wet (November to May). The average annual precipitation in IDB ranges from 600 to 900 mm, but the northeastern part (near the border of Kenya) comes to more than 1,000 mm. Almost all the rivers in this region are seasonal and flow from December to July, but they are often dry for the rest of the year. In the central and northeastern parts of IDB, there are volcanoes such as Mt. Hanang, Kilimanjaro (the highest mountain in Africa), and Ngorongoro crater. In the northern to the central part of IDB, several large lakes are located, such as Lake Natron, Lake Manyara, and Lake Eyasi (JICA 2008).

2.2 Datasets

A total of fifty-eight daily precipitation gauge (Fig. 1) records were analyzed. Data were provided by the Ministry of Water in Tanzania. The quality control of data, as shown in Supplementary Material (SM) (Figure SM1b), was perfomed based on the framework suggested in Wijngaard et al. (2003) and Ghajarnia et al. (2022). Statistical tests were utilized to exclude gauges with low quality data. Two gauges were excluded (more details on tests in the SM). In addition to in situ observations, five precipitation products were used in this study (Table 1). The current study was conducted for the 2000–2010 period because it has the highest overlap with precipitation products (limited by the availability of satellite products and gauge observations; Figure SM1a).

Fig. 1
figure 1

Main nine sub-basins of Tanzania: 1-Internal drainage basin (IDB), 2-Lake Nyasa, 3-Lake Rukwa, 4-Lake Tanganyika, 5-Lake Victoria, 6-Pangani, 7-Rufiji, 8-Ruvuma South Coast, and 9-Wami Ruvu, with the location of rain gauges

3 Methodology

Three approaches of gap-filling were examined in this study. Utilizing: (i) the FCDNN and RF methods as two machine learning techniques; (ii) well-known precipitation products available globally; (iii) the reddPrec software developed for gap-filling of daily precipitation. Precipitation products are shown in Table 1. In the following sections, FCDNN, RF, and reddPrec are explained briefly.

3.1 Gap-filling by Machine Learning Algorithms

Table 2 summarizes the daily climate variables used as the inputs/features of the machine learning models. Models are trained with 70% of the available daily precipitation records (training set), while the hyperparameters are tuned over 15% of the records (validation set), and the remaining data is used for independent evaluation of the machine learning models and other gap-filling approaches (test set). These meteorological inputs are taken from the Modern-Era Retrospective analysis for Research and Applications version 2 (MERRA-2) which has about 50 km spatial resolution.

Table 2 Meteorological parameters utilized as inputs of the gap-filling methods by machine learning

3.1.1 Fully Connected Deep Nueral Netwroks (FCDNN)

FCDNN is a multilayer feed-forward neural network that is the simplest (Moghaddam et al. 2022) and one of the most common neural network forms (Partal and Kişi 2007). Each layer consists of several processing units (neurons). Each neuron is connected to adjacent layers with an individual weight assigned to each interlayer link. All inputs into a single neuron are multiplied by their associated weights and summed up to form a single output. Finally, each of these outputs is subject to a nonlinear transformation referred to as the activation function. As a result, FCDNN can be represented as a nested set of functions. It is the superposition of many simple nonlinear functions that enable FCDNN to estimate non-linear functions. FCDNN is fully connected, with each node connected to every node in the next and previous layer (Gardner and Dorling 1998). The number of layers, the number of nodes in each layer, the loss function, and the learning rate are among the hyperparameters that should be tuned for the FCDNN model in this study.

3.1.2 Random Forest (RF)

Random Forest (RF) was first introduced by Breiman (2001) as a supervised learning algorithm. Random forests are a combination of predictors (i.e., trees) such that each of them depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. Internal estimates measure variable importance, and also monitor error, strength, and correlation utilized to show the response to increasing the number of inputs used in the classification. The number of trees and minimum sample split are the hyperparameters tuned for the RF model.

3.2 Gap-filling by the Reconstruction of Daily Data – Precipitation (reddPrec) Software

We selected the reddPrec software (www.cran.r-project.org/web/packages/reddPrec) because: (i) it applies comprehensive quality control over original daily precipitation datasets, and flags suspicious data based on five predefined criteria; and (ii) it fills missing values in original data series by estimating precipitation values using a number of nearest observations for each day. The reddPrec creates daily reference values using all the data recorded at the nearest stations for each targeted day. Multivariate logistic regression is used to compute these reference values based on the nearest neighbors and geographic and topographic variables as covariates. A threshold parameter is integrated to set a maximum distance (in km) to the search for the nearest neighbors (Serrano-Notivoli et al. 2017).

3.3 Evaluating the Performance of Gap-filling Approaches

30% of available daily precipitation data (15,075 observations) were randomly nulled and split as the validation and test sets. The remaining data were used for training machine learning (FCDNN, RF) methods and running the reddPrec software. The validation set was used for tuning the hyperparameters of the machine learning models. Finally, gap-filling approaches were compared over the test set. For consistency, precipitation products were compared with each other with the same validation data set. The best-performing precipitation product was selected based on several evaluation metrics. Finally, all approaches were compared with each other to select the best gap-filling approach in the study area.

3.3.1 Evaluation Metrics

The Pearson correlation coefficient (CC), the relative bias (Rbias), the root mean square error (RMSE), the probability of detection (POD), the false alarm ratio (FAR), and the Heidke skill score (HSS) metrics were used to evaluate the performance of each method. More details on each metric are presented in the Supplementary Materials file.

4 Results and Discussion

Among examined precipitation products, PERSIANN-CDR outperformed the rest in some evaluation metrics (compare Fig. 2d and Figure SM2). PERSIANN-CDR has the least RMSE (7.3 mm), the highest correlation coefficient (0.46), and the highest HSS among all products. Though other precipitation products have fewer Rbias than PERSIANN-CDR (e.g., Rbias of IMERG is -32% compared to that of PERSIANN-CDR which is -50%), PERSIANN-CDR was selected as the best-performing precipitation product because of better scores in other metrics.

Comparison of daily precipitation estimates by other evaluated methods (FCDNN, RF, and reddPrec) against gauge-based observations (Fig. 2) revealed that RF has the lowest RMSE (6.9 mm), the highest correlation coefficient (0.53), and least Rbias (-2%). Therefore, RF as a machine learning-based imputation method can improve evaluation metrics considerably compared to PERSIANN-CDR. This method is less biased than all examined precipitation products (Fig. 2c and Figure SM2). reddPrec has the least skill based on its high RMSE (14.2 mm) and low correlation coefficient (0.19).

Based on the histogram of observations in the validation set (Table SM1), 85% of daily precipitation values are less than 2.3 mm. Also, 90%, 95%, and 99% of precipitation records are less than 9.2, 16.1, and 39.1 mm, respectively. The frequency of precipitation above 51 mm is less than 0.5%. Most data in the scatter plots (Fig. 2) are inclined toward the x-axis (observation) for FCDNN, RF, and PERSIANN-CDR which explains the negative values for Rbias for these methods. Based on the Rbias equation (Eq. S3 in the SM), negative Rbias means that the model underestimates compared to observations. On the other hand, the reddPrec method has a positive Rbias and the majority of data are inclined toward the y-axis (model estimate); thus, this method is overestimating daily precipitation.

The standard deviation of predicted daily precipitation records for FCDNN, reddPrec, RF, and PERSIANN-CDR is 8.1, 2.3, 13.6, 4.1, and 3.8 mm/day. FCDNN, RF and PERSIANN have lower standard deviations than the observations. Estimates by these methods are mainly less than 20 mm/day while observations show that the frequency of precipitation between 20 and 50 mm/day is considerable (x-axis in Fig. 2). The standard deviation of reddPrec is higher than the observations, and the points are more scattered toward the y-axis (Fig. 2b).

Fig. 2
figure 2

Scatter plots of the observed daily precipitation (mm/day) and estimated values by (a) FCDNN, (b) reddPrec, (c) RF, and (d) PERSIANN-CDR (the color bar shows the density of points in the plot)

The reddPrec software has a parameter (“thres”) to search nearest stations within a specific distance. If this parameter is set to “NA”, the software will search 10 nearest observations without a distance limit (more detail in https://cran.r-project.org/web/packages/reddPrec/index.html). In the IDB, 92% of the stations are located 0-400 km from each other (Table SM2). Therefore, we investigated the effect of different “thres” values (i.e., “NA”, 50, 100, 200, 300, and 400 km) on the performance of the reddPrec. Based on Figure SM3, it was found that in the sparsely gauged IDB, introducing lower values for “thres” will improve reddPrec performance slightly. However, this minor improvement in the performance will compromise the number of gaps filled by reddPrec. When “thres” is defined as “NA”, the majority of the validation set is filled by reddPrec (95%). However, in low values of “thres”, less than 60% of gaps are filled. Thus, we decided to set “thres” to “NA” (results in Figs. 2, 3 and 4 are based on this assumption).

Fig. 3
figure 3

Performance of different methods at different thresholds in terms of: (a) HSS, (b) POD, and (c) FAR

Fig. 4
figure 4

Spatial distribution of mean error for the stations in the study area for different models: (a) FCDNN, (b) reddPrec, (c) RF, (d) PERSIANN-CDR, and (e) boxplots for different models based on the mean of error for each station

Calculation of the categorical metrics (POD, HSS, and FAR) using different precipitation thresholds (0, 0.1, 0.5, 1, 2, and 5 mm/day) for distinguishing between rainy and non-rainy days revealed critical information on the capacity of each method in detecting precipitation events (Fig. 3). Among the imputation methods, RF and FCDNN have the best POD (true hits divided by the sum of true hits and misses as shown in Eq. S4 in SM). However, in low thresholds (0-0.5 mm/day), RF has higher FAR (false alarms divided by the sum of true hits and false alarms as shown in Eq. S5 in SM) compared to other methods. As the threshold increases, the performance of the RF model improves indicating that the RF model has problems capturing low precipitation rates.

The HSS metric, aggregating the effect of FAR and POD was investigated to consider the effect of both directions in the contingency table (true hits and false alarms as well as true hits and misses). Therefore, HSS is a good score to quantify the trade-off between FAR and POD. In the low thresholds (0-0.5 mm/day), the trade-off between the high value of FAR and POD for RF has led to the lowest values of HSS for this method. Although this method is powerful in correctly detecting rainy events (high POD), RF (high FAR) reports many days wrongly as rainy. Therefore, the overall strength of RF for distinguishing between rainy and non-rainy days is very low compared to other methods in low thresholds. Increasing the threshold increases RF performance considerably, so this method achieves the highest HSS for a 5 mm/day threshold among all methods.

The reddPrec method has consistently low POD and high FAR. Therefore, this method has the lowest HSS among others. On the other hand, FCDNN and PERSIANN-CDR have consistently high POD and low FAR (specifically in thresholds less than 2 mm/day). Thus, these methods have the highest HSS. However, in higher thresholds, the performance of FCDNN decreases since its FAR increases considerably compared to little improvement in POD. Based on the HSS value, the PERSIANN-CDR performs best when the threshold is set to 1 mm/day (the highest HSS is 0.64).

The spatial distribution of error for stations is displayed in Fig. 4. The majority of stations (62%) filled by the RF method (Fig. 4c) have an absolute error of less than 0.5 mm/day. After RF, reddPrec (Fig. 4b), has the highest number of stations having low error (-0.5 to 0.5 mm/day shown by green points). However, for reddPrec, overestimation in other stations (big red points in Fig. 4b) has affected overall accuracy (Fig. 2b). FCDNN and PERSIANN-CDR mainly underestimate daily precipitation although the performance of PERSIANN-CDR is better (Fig. 4a and d). It is noteworthy that the mean error for all stations is very close to zero for RF as the most accurate model.

All examined methods in this research have limitations. Although reddPrec has the least accuracy among all other methods, the spatial distribution of error in the stations reveals that this method is more successful in filling the gaps of data in many locations compared to FCDNN and PERSIANN-CDR. The “thres” parameter is very important in reddPrec as it determines which adjacent stations can be used for gap-filling. IDB is a sparsely gauged basin, so the “thres” had to be set to “NA” to fill the highest ratio of gaps. This will introduce more uncertainty/error into the reddPrec method because distant stations are allowed to be utilized in the gap-filling process. Even though the “thres” was set to “NA”, 5% of gaps in the test set were not filled by reddPrec. High error values in some stations may be attributed to what is described above.

The study area is a data-deprived region which has many gaps in the observations. In other words, we do not have a continuous time-series in gauges; hence, options for selection of machine learning algorithms are limited. Many algorithms such as convolutional neural networks could not be used because of lack of data, so we have included only a fully connected neural network in our models. Both machine learning-based methods use gridded meteorological parameters (Table 2 from MERRA-2). All precipitation products are similarly gridded data. Therefore, values of the nearest cell to each rain gauge were used to fill the gaps. Attributing the values of the whole cell to a point will introduce uncertainties/errors in estimations, especially in a low-resolution dataset as the cell values represent the entire cell, not the gauge location. Finally, it should be noted that most of the daily precipitation records are less than 1 mm/day although the range of daily precipitation reaches 100 mm/day. The high concentration of low values (less than 1 mm) in records used for training RF and FCDNN led to low variance in their predictions. Thus, FCDNN and RF could not estimate precipitation higher than 10 and 20 mm/day, respectively. This can also be attributed to the limited number of samples representing higher precipitation rates in the training set, inaccuracy of the high precipitation rates in the reference set, and/or lack of related info in the feature to represent extreme precipitation rates.

5 Conclusions

This study investigated the performance of three precipitation gap-filling approaches over a sparsely gauged region in Tanzania: (i) precipitation products including GPCC V2020, GPCP V1.3, PERSIANN-CDR, ERA5, and IMERG Final V6; (ii) machine learning-based imputation approaches such as Fully Connected Deep Neural Network and Random Forest, and (iii) daily precipitation gap-filling software namely reddPrec. Based on available data in the study area, 2000–2010 was selected as the study period because of the highest overlap between in situ records and satellite/reanalysis data. To evaluate the performance of each approach, we utilized 30% of the available rain gauge records (15,075 observations) for validation and testing and the rest of the data was used for training of machine learning-based methods (category ii) and reddPrec software (category iii). Evaluation of precipitation products (category i) against the test set revealed that PERSIANN-CDR has the best performance compared to other examined precipitation products (the lowest RMSE, and the highest correlation coefficient). Also, PERSIANN-CDR has the best performance in detecting rainy days based on HSS among all examined gap-filling methods. However, other precipitation products are less biased than PERSIANN-CDR (based on Rbias). This study showed that machine learning-based gap-filling methods trained by meteorological data (from MERRA-2) have overall better performance compared to other methods. Random Forest has a lower bias than PERSIANN-CDR, and is the best-performing product in the study area. RF also has the lowest RMSE, highest correlation coefficient, and lowest bias among all examined methods/precipitation products. The main difference between the trained Random Forest model and global precipitation products (e.g., PERSIANN-CDR) is in the utilization of a higher number of in situ records in the training process. The accuracy of global precipitation products suffers from the lack of in situ data in the calibration process, especially in developing countries. In these countries, the contribution of in situ records in international data centers (e.g., Global Telecommunication System) is very low. Global Telecommunication System, as an example, is a source of calibration for many precipitation products. Consequently, the accuracy of precipitation products (e.g., GPCC used for bias adjustment in PERSIANN-CDR, GPCP, and IMERG) is negatively affected by the lack of in situ data.