1 Introduction

Rainfall plays a significant role in the water cycle and is an important climate variable (Trenberth et al. 2003). It is closely linked to atmospheric circulation in weather and climate research (Kidd and Huffman 2011). An accurate and reliable estimation of rainfall is crucial for many scientific studies and practical applications including managing water resources, weather and climate forecasting, assessing and mitigating flood and drought risks, hydrology, optimizing crop water use and irrigation scheduling, disease outbreak modeling and forecasting, etc. (Latif et al. 2023; Sokol et al. 2021; Kaune et al. 2019; Liu et al. 2016; Johansson et al. 2016; Moore et al. 2012; Younis et al. 2008). Rain gauges and recording sensors are useful for measuring rainfall values at a specific point location, but they are not available in many sparsely populated or mountainous areas and can be affected by sampling errors and missing data (Ringard et al. 2015; Deng et al. 2018). To provide global and spatial coverage of rainfall estimates at regular intervals, gridded rainfall products have been developed. These products overcome the limitations of sparse ground stations by providing accurate rainfall data and filling in the data gaps, even in remote regions, thus significantly enhancing the input data for several applications (Henn et al. 2018). Standardized rainfall products are crucial for providing consistent, reliable, and comparable rainfall data across different regions and timescales. Various collaborative efforts between international organizations and space agencies, such as NASA, JAXA, NOAA, and ECMWF, are driving the development of robust rainfall products. These products are beneficial for researchers, policymakers, and the general public. Currently, several global rainfall products are available, and new ones are continuously being developed to ensure accurate rainfall estimates. These products are typically presented in gridded datasets that consider spatial variability and account for the irregular distribution and spatial coverage limitation of point-based ground station observations (Sun et al. 2018). The gridded products develop into different categories, including gauge-based interpolations, satellite-based estimates, combinations of gauge and satellite data, and reanalysis systems using various techniques (Sun et al. 2018).

Several gridded rainfall data products with higher resolution in terms of space and time have been developed and used in numerous studies. However, these products are not entirely consistent in terms of accuracy and their performance varies spatially due to differences in data sources (e.g., gauge observations, reanalysis products, or satellite retrievals), location, topography, season, and hydro-climatic characteristics (particularly precipitation type convection or advection and its underlying mechanisms of rainfall generation) of a particular area and different product processing algorithms and techniques (Tapiador et al. 2012; Toté et al. 2015; Derin and Yilmaz 2014; Gebremichael, et al. 2014; Hu et al. 2014).

In the Nile Basin (Ethiopia), Bayissa et al. (2017) compared five high-resolution remote sensing-derived rainfall products with ground observation data and found that the CHIRPS dataset performed better than the others, while PERSIANN presented the weakest performance in the similar situation. Dinku et al. (2010) evaluated seven different satellite rainfall estimates over Colombia and reported that CMORPH and GSMaP-MVK+ performed the best among the products evaluated. Prakash et al. (2014) found that TRMM 3B42-V7 is the most reliable product for monsoon studies in South Asia region, compared to CMORPH and PERSIANN, indicated by its high correlation, low RMSE, and small underestimation. Similarly, in the western Black Sea region of Turkey, TRMM outperformed CMORPH at different temporal scales (Derin and Yilmaz 2014). However, in contrast, TRMM 3B42-V7 product showed the poorest performance in comparison to CMORPH, GSMaP, and PERSIANN at daily resolution across central Asia (Guo et al. 2015).

These studies indicate that there is no single rainfall product that can accurately perform in different climatic zones of the world. Therefore, it is important to validate and compare the performance of different rainfall products against ground measurements for a specific area and temporal scale prior to their selection and use for any subsequent applications and decision making (Ouma et al. 2012; Feidas 2009). Furthermore, it is crucial to assess the gridded rainfall products at regional and local scales to enhance the development of interpolation methods, satellite-based algorithms and sensors for improved estimation of rainfall.

Although several rainfall products are readily available in the public domain, most existing studies have compared and evaluated only a few of these products, particularly for the monsoon climate. This limited approach can make it challenging to select the most effective rainfall product for a particular area. Moreover, advanced techniques have led to the development of new rainfall products that are so far not widely used in evaluations. Thus, choosing only a few products for comparison may hinder the ability to identify the best-performing publicly available rainfall product. Moreover, monsoon climates are characterized by seasonal variations in rainfall, with heavy and frequent rain events, which make accurate rainfall estimation essential for several applications, such as agriculture, water resource management, and disaster preparedness. However, there is a lack of literature that provides a comprehensive evaluation of various rainfall products for monsoon climates. Furthermore, in most of the previous studies, the comparison and evaluation of different rainfall products rely on a few common statistical methods. However, advanced statistical analysis considering both continuous and categorical metrics and a combined overall raking method (composite score) should be employed for a comprehensive analysis.

To address this research gap, our study evaluated and compared 23 publicly available rainfall products at different temporal scales for a monsoon-dominated urban climate, using a case study of Patna in India. In addition, the novelty of our approach also lies in the consideration of several statistical methods i.e., thirteen continuous, ten categorical metrics and a composite score for extensive analysis. Furthermore, there has been no prior research on in-depth analysis of the performance of different rainfall products in Patna. Thus, our research is a first attempt to fill this research gap.

2 Study area

Patna is the capital of Bihar state and the second-largest city in Eastern India. It experiences a typical urban monsoon climate that plays a substantial role in shaping the cultural landscape and influencing the lives of its inhabitants (Fig. 1). Patna’s monsoon season lasts from late June to late September and brings more than 90% of the annual rainfall. It is shaped by geographical factors such as low elevation, flat topography, and its location south of the river Ganga, making it highly vulnerable to floods (Mirza 2011). Given the high population density and inadequate drainage system in the city, extreme rainfall events can have drastic repercussions. Therefore, it is imperative to have accurate monitoring and understanding of the best-performing rainfall products, particularly during the monsoon season, to facilitate better planning and disaster risk management (Kelley et al. 2020).

Fig. 1
figure 1

The location of the study area and the rainfall gauge stations. * Indian Council of Agriculture Research (ICAR) and Patna Municipal Corporation (PMC)

3 Data

3.1 Observed rain gauge data

The main objective of this study is to assess and compare the accuracy of major global gridded rainfall products. Therefore, the accuracy of the ground-based (gauge) rainfall measurements that are used as a validation dataset is crucial in this investigation.

Our study focused on only one weather station at Patna airport (Fig. 1) with daily recorded rainfall data from 2000 to 2015. This station had a long time series of quality-controlled data, located at 53 m above the mean sea level. Since, the study region is a plain, no major topographic effect on rainfall distribution is expected. Another rainfall station named as ICAR rainfall station had high-quality controlled data but we did not use it because of its short observation records (2013–2018) and its proximity to the Patna airport rainfall station. We only considered ICAR rainfall station, to check the quality of observed rainfall values recorded at the Patna rainfall station. This helped us to identify and check any potential outliers resulted from the heavy rainfall events (more than 50 mm per day) and missing values.

3.2 Global gridded rainfall products

In the following, we briefly describe 23 gridded rainfall products (listed in Table 1) that were evaluated and compared for their performance based on the ground-measured rainfall data in this study. These selected rainfall products have been categorized into three groups based on the data source and development algorithm. Group I: Gauge-based interpolated rainfall products; Group II: Remote sensing derived rainfall products with gauge-based adjustments; and Group III: Rainfall reanalysis products.

Table 1 Selected Global Rainfall Products

While a comprehensive overview of the various algorithms and techniques employed in generating gridded rainfall datasets is beyond the scope of this study, readers may refer to the referenced literature for more detailed explanations.

3.2.1 Gauge-based interpolated rainfall products

The Global Climate Observing System (GCOS), a branch of the World Meteorological Organization (WMO), was founded in 1992 to provide rainfall data, freely available in interpolated gridded format (Kidd et al. 2017). A gauge based interpolated rainfall products used in this study are as follows (please refer to Table 1 for details).

  1. 1.

    India Meteorological Department (IMD) gridded rainfall data

  2. 2.

    Climate Research Unit (CRU) version 4.04 is a global rainfall product

  3. 3.

    Global Precipitation Climatology Centre (GPCC) version 2018

  4. 4.

    Asian Precipitation—Highly-Resolved Observational Data Integration Towards Evaluation of Extreme Events (APHRODITE) 2.

3.2.2 Remote sensing based rainfall estimates

Global and continuous measurements of atmospheric parameters are only made possible by satellite systems, which include sensors capable of consistent and worldwide rainfall estimation (Prigent 2010; Michaelides et al. 2009). Furthermore, several advanced and improved gridded rainfall estimates have been developed by combining satellite-derived rainfall data with gauge measurements using various techniques (Belay et al. 2022).

  1. 5.

    Climate Prediction Center MORPHing (CMORPH)

  2. 6.

    Climate Prediction Center—Unified Precipitation Product (CPC-UPP)

  3. 7.

    Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS)

  4. 8.

    Global Precipitation Climatology Project (GPCP)—Version 2.3

  5. 9.

    Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks—Cloud Classification System (PERSIANN-CCS). The PERSIANN-CCS algorithm utilizes more information from the infrared cloud images compared to other PERSIANN products (Nguyen et al. 2018).

  6. 10.

    PERSIANN—Climate Data Record (PERSIANN-CDR)

  7. 11.

    PERSIANN Dynamic Infrared–Rain Rate (PDIR) – It is designed to replace the PERSIANN-CCS dataset

  8. 12.

    Tropical Rainfall Measuring Mission (TRMM) dataset is a widely used rainfall dataset over the tropics. The TRMM satellite was decommissioned on April 8, 2015 and replaced by the Global Precipitation Measurement (GPM) Core Observatory satellite in early 2014 (Liu et al. 2012).

  9. 13.

    Integrated Multi-Satellite Retrievals for GPM (IMERG) – version 6

  10. 14.

    Asian Integrated Multi-satellitE Retrievals for Global Precipitation Measurement (AIMERG)

  11. 15.

    Global Satellite Mapping of Precipitation (GSMaP)

  12. 16.

    Multi-Source Weighted-Ensemble Precipitation (MSWEP) is a global rainfall data,

3.2.3 Reanalysis-based rainfall estimates

Reanalysis systems combine data from various sources, including models and observations, to produce a comprehensive and uniform estimate of a system's state. This estimate is generated on a spatially and temporally consistent grid, with a hierarchical approach used to calculate surface precipitation (Lakew et al. 2020; Bosilovich et al. 2008).

  1. 17.

    Climate Forecast System Reanalysis (CFSR)

  2. 18.

    Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2).

  3. 19.

    European Centre for Medium-Range Weather Forecasts Reanalysis v5 (ERA5). ERA5 supplants the previous ERA-Interim reanalysis, which ceased production on August 31, 2019 (Bell et al. 2021).

  4. 20.

    European Centre for Medium-Range Weather Forecasts Reanalysis v5 Agriculture (ERA5Ag)

  5. 21.

    European Centre for Medium-Range Weather Forecasts Reanalysis v5 Land (ERA5-Land)

  6. 22.

    Rainfall Estimates on a Gridded Network (REGEN)

  7. 23.

    TerraClimate

Table 1 lists the selected global rainfall products with a summary of their spatial resolution and time period considered in our study and the general features, data source and reference for detail description.

4 Methodology

Our aim was to utilize the publicly available global datasets to their fullest potential by considering 23 major global rainfall products. The selection was based on their long-time series availability, different spatial and temporal resolutions, and free public domain availability.

We performed a rigorous and comprehensive analysis using point-to-grid and grid-to-grid evaluation approaches. Three sets of validation statistic metrics were applied: continuous statistics, categorical statistics and composite evaluation metrics. We considered four temporal resolutions: daily, weekly, monthly and annual, which represent standard time-steps for rainfall data analysis and reflect the different requirements on the temporal resolution of rainfall data for several applications, such as for hydrological modelling and water management concepts depending on their purpose (e.g. daily data for irrigation scheduling; annual sums for long-term analyses on climate change impact on the water cycle).Continuous and composite evaluation metrics were applied for rainfall data at all four temporal resolutions whereas categorical statistics was applied for daily rainfall data only. Furthermore, we plotted correlation (r), root mean square error (RMSE), and standard deviation (SD) at a time using the Taylor diagram (Taylor 2001) for better understanding and visualization of validation results. Finally, the rainfall intensity distribution of different rainfall products against the observed data using the probability distribution function (PDF) was conducted and the rainfall intensity was categorized using IMD guidelines.

The methodology utilized in this research is illustrated in Fig. 2, which displays the various methodological approaches considered in a sequential step-by-step manner. The details follow in the respective sections.

Fig. 2
figure 2

Methodology flow diagram

4.1 Validation datasets

20 rainfall products listed in Table 1 were available at daily time steps, so they were evaluated for all four temporal scales (daily, weekly, monthly and annual) whereas three rainfall products (CRU, GPCP and TERRA) were available at monthly time steps and therefore they were evaluated only for monthly and annual time scales.

The observed rainfall data during the period 2000–2015 (except 2014) at Patna airport station (validation dataset) has only a few data gaps (less than 2%). The quality of validation data is important in comparing and evaluating gridded rainfall products. Hence, we did not complement the data gaps based on information from neighboring stations. In case of a data gap in the validation data, the corresponding values from the global gridded derived rainfall products were made NA and not included in the comparison. Furthermore, the year 2014, with several missing data was not considered in the point to grid evaluation method.

Data availability was the main criterion for defining our validation time period. Different gridded rainfall products cover different time durations of rainfall estimates. However, most of the rainfall products have records from 2000 to 2015. Hence, in order to maintain consistency in the time frame for comparison, we used data from 2000 to 2015 for evaluation and inter-comparison of selected gridded rainfall products.

4.2 Comparison approach based on reference data at different spatial scales

There is no single best evaluation method. Thus, we considered both point-to-grid and grid-to-grid comparison methods due to the pros and cons related to each method.

4.2.1 Point-to-Grid comparison

The rain gauge is the primary source of rainfall data, which has been the most accurate and reliable approach for rainfall measurement and is considered as an ideal validation dataset. Thus, in our study we used a point-to-grid comparison approach similar to previous ones (Ayehu et al. 2018; Dembélé and Zwart 2016; Tan et al. 2015). At the Patna airport station (point location), the corresponding grid values of gridded rainfall products were extracted and pair-wise comparisons with rain gauge values were undertaken.

4.2.2 Grid-Grid comparison

For grid-to-grid comparison, we did not resample the original grid resolution of the globally derived rainfall products to match the IMD gridded reference data because the grid resolution of a dataset is an intrinsic property of a particular data product and an important influencing parameter of the product performance. Changing the grid resolution could affect the original product performance during the evaluation. Thus, we have not brought the different products in the same resolution. Additionally, some satellite products perform well because of their smaller grid size, which must be considered as an inherent property of the product during evaluation (Cohen et al. 2012).

4.3 Evaluation methods: statistics metrics

4.3.1 Continuous quantitative statistical metrics (Please see the details in Table S2)

  1. 1.

    Pearson product-moment correlation coefficient (r)

  2. 2.

    Coefficient of Determination (R.2)

  3. 3.

    bR.2

  4. 4.

    Root Mean Square Error (RMSE)

  5. 5.

    Mean Error (ME)

  6. 6.

    Mean Absolute Error (MAE)

  7. 7.

    Mean Square Error (MSE)

  8. 8.

    Ratio of RMSE to the Standard Deviation of the Observations (RSR)

  9. 9.

    Percent Bias (PB)

  10. 10.

    Index of Agreement (D)

  11. 11.

    Modified Index of Agreement (MD)

  12. 12.

    Nash Sutcliffe efficiency (NSE)

  13. 13.

    Kling-Gupta efficiency (KGE)

Continuous statistical metrics is applied for rainfall data at all the four temporal scales (daily, weekly, monthly and annual).

In general, when comparing estimates to observations, a satisfactory agreement for PBIAS and r values is considered to be within ± 25% and above 0.5, respectively (Moriasi et al. 2007). The closer the RMSE and MAE values are to zero, the more accurate the rainfall product estimates are in comparison to the ground measurements.

4.3.2 Categorical statistical metrics

The categorical statistical metrics were used to evaluate the rainfall detection ability of gridded (derived) rainfall products. It is in the form of a binary variable that indicates either yes (1) or no (0) for a rainfall event based on a threshold value. In this study, the daily rainfall threshold was set to 0.5 mm/day which means any rainfall amount above 0.5 mm/day is considered as a rainy day (1), whereas below 0.5 mm/day is recorded as a non-rainy day (0). The categorical statistical metrics is applied only for the daily rainfall datasets.

In order to compare forecasts (gridded rainfall product) with occurrences (observed rainfall), we used a contingency table to show the frequency of yes and no forecasts and occurrences. There are four possible outcomes: (i) Hit—event forecast to occur, and did occur; (ii) Miss—event forecast not to occur, but did occur; (iii) False Alarm—event forecast to occur, but did not occur; (iv) Correct Negative—event forecast not to occur, and did not occur. Based on the contingency table, we considered ten categorical statistical metrics briefly discussed below (please refer Table S3 for details) with (WMO 2017) providing detailed information.

  1. 1.

    Probability of Detection (POD) also known as “hit rate” measures the ratio of the number of rain events correctly detected by the derived rainfall product to the total number of actual rainfall events.

  2. 2.

    False Alarm Ratio (FAR) evaluates the ratio of the number of rainfall events falsely detected by the derived rainfall products to the total observed rainfall events.

  3. 3.

    Probability of False Detection (POFD) also known as “False Alarm Rate”. It indicates the fraction of the observed "no" events that were incorrectly forecasted as "yes" by the derived rainfall product.

  4. 4.

    Critical Success Index (CSI) also known as Threat Score. It is a function of POD and FAR and provides a more balanced score.

  5. 5.

    Heidke Skill Score (HSS) also known as Cohen's k. It is a skill score based on Accuracy, where the Accuracy is compared to the number of correct forecasts that would be expected by chance.

  6. 6.

    True Skill Statistic (TSS) also known as Hanssen and Kuipers discriminant or Peirce's skill score. It considers all the elements of the contingency table and estimates how well the forecast separated the "yes" events from the "no" events.

  7. 7.

    Equitable Threat Score (ETS) also known as “Gilbert skill score”. It is based on the CSI, corrected for the number of hits that would be expected by chance.

  8. 8.

    Accuracy (ACC) also known as Frequency Correct. It is used to measure the proportion of correct decisions out of all decisions made.

  9. 9.

    Bias score (BIAS) also known as Frequency Bias. It determines how did the forecast frequency of "yes" events compares to the observed frequency of "yes" events.

  10. 10.

    Area Under Curve (AUC) summarizes the overall performance.

    These categorical metrics collectively enable a comprehensive comparison of methods, empowering informed decision-making.

It is crucial to focus on specific statistics depending on the intended use of rainfall products (Toté, 2015). In the case of flood forecasting and hydrology, it is essential to prevent the underestimation of rainfall events and amounts, which would result in ME < 0 and low POD. In contrast, for drought monitoring, it is vital to avoid overestimations and, therefore, ME > 0 and high FAR. For general purposes, products with high correlation, R2, bR2, NSE, KGE, MD, D, POD, CSI, HSS, TSS, ETS, ACC, BIAS, Precision, Recall, F1 score and AUC and low RMSE, ME, MAE, MSE, RSR, PBIAS, FAR and POFD should be prioritized (Toté, 2015).

4.3.3 Composite evaluation metrics

There is no single statistic that can provide the overall performance of a derived rainfall product due to the prevailing limitation of individual statistical methods. It is obvious that a number of statistical methods are needed to provide a comprehensive analysis of the performances of different rainfall products. In order to derive an overall composite score for continuous statistics, we used the “Overall Ranking” approach called as composite score. In this approach, each rainfall product was assessed based on its performance on several metrics, and their performance was ranked accordingly. The product with the highest performance was given a rank of one, the second-highest a rank of two, and so on, up to the lowest-performing product. The overall ranking for each product was determined by adding up its ranks across all the metrics and called as the composite score of the particular rainfall product. The product with the lowest cumulative value (composite score) was considered the best performer overall and vice versa. In this approach, no weights were assigned to the evaluation metrics, and could be argued as a potential limitation. However, it can be argued that should all statistical methods in a combined score receive equal weight, or should some get more emphasis than others? Since our study is intended to assess the overall performance of the evaluation metrics in rainfall products evaluation without considering a single application, assigning weight to metrics are not applicable in our study. Studies focused on the use of rainfall products performance in specific applications such as flood or drought monitoring, prioritizing metrics is evident. For instance, in drought monitoring, metrics related to low rainfall detection should be given more weight. Conversely, in flood forecasting, metrics for high-intensity rainfall accuracy such as bias and root mean square error should be prioritized. (Camici et al. 2020). This is to be noted that the composite score is derived from continuous statistical metrics only.

4.4 Probability distribution function (PDF)

PDF was computed to evaluate the frequency of different rainfall intensities based on daily rainfall data (Li et al. 2013). We classified the daily rainfall intensity into nine distinct levels based on the IMD classification (Barde et al. 2020). These nine levels are named as no rain (0.0 mm/day), trace rain (0.01 to 0.04 mm/day), very light rain (0.1 to 2.4 mm/day), light rain (2.5 to 7.5 mm/day), moderate rain (7.6 to 35.5 mm/day), rather heavy rain (35.6 to 64.4 mm/day), heavy rain (64.5 to 124.4 mm/day), very heavy rain (124.5 to 244.4 mm/d), and extremely heavy rain (≥ 244.4 mm/day).

5 Results and discussion

5.1 Analysis of observed rainfall data at Patna airport station

The average annual observed rainfall (2000–2015) at Patna airport station is 1,000 mm with a maximum annual value of 1,941 mm recorded in 2008 and a minimum value of 741 mm recorded in 2009 (Fig S1a).

The daily box plot shows outliers, indicating unusually high rainfall recorded on a single day in August 2014 and August 2002 (Fig. S1b). These outlier values were verified with a neighboring ICAR rainfall station, which confirmed receiving rainfall to the same extent. This validation process ensured the quality of the measured rainfall data at Patna airport, which was used as a reference dataset for evaluating various gridded rainfall products.

The rainfall in Patna shows a seasonal pattern with maximum rainfall occurring during the monsoon season (June to October) whereas November to May is generally considered as a dry season with low rainfall (Fig. 3a). The mean monthly maximum rainfall occurred in August, whereas the highest standard deviation was reported in July.

Fig. 3
figure 3

a Monthly rainfall distribution plot for of Patna airport rainfall station (mm/month) b Monthly rainfall boxplot

The monthly box plot analysis (Fig. 3b) depicts that high levels of rainfall are observed in June, July, August, and September. Specifically, the median rainfall in July is approximately 250 mm, with the interquartile range (IQR) extending between 170 to 370 mm. The whiskers in July range from roughly 50 mm to just over 600 mm, indicating significant variability and occasional extreme rainfall events. In August, the median rainfall is slightly higher than in July, with an IQR extending from approximately 180 mm to 280 mm. June also shows substantial rainfall with a median of around 100 mm, an IQR from about 75 mm to 275 mm, and whiskers extending from roughly 50 mm to 500 mm. In contrast, the months from January to May and from October to December show significantly lower rainfall levels. During these months, the median rainfall generally stays below 50 mm. November, December, January, February, and March exhibit the lowest rainfall, with medians close to zero, reflecting the dry season. April and May begin to show increased rainfall, with median values reaching around 25 mm and 50 mm, respectively.

5.2 Continuous statistical analysis at daily time steps

5.2.1 Point-to-Grid comparison

Our findings indicate that the different rainfall products determine the daily rainfall values with notable variations in accuracy. Table S4 and Fig. 4a reveal that the correlation coefficient (r) value ranges from 0.22 (CPCUPP) to 0.79 (AIMERG), indicating that some of the products are highly correlated with the observed (station) data, while others have a weak correlation. Twelve (12) out of 20 rainfall products showed negative PBIAS, indicating underestimation of observed rainfall and eight products showed positive PBIAS representing overestimation. KGE (which takes into consideration correlation, bias, and variability, all together in the data) found that the IMD GRID was the top performer with a value of 0.46 whereas CPCUPP was the worst performer having the lowest value of 0.17.

Fig. 4
figure 4

Taylor diagram at daily time steps

With the best r, R2, bR2, MAE, RSR, D, MD and NSE, AIMERG outperforms in the majority of evaluation metrics. However, in the overall ranking (composite score) it ranked third (Table 2a). The top five rainfall products were MERRA2, IMD Grid, AIMERG, APHRODITE and MSWEP whereas PCCS, CPCUPP, CFSR, TRMM and CMORPH were the least-performing products at daily time steps.

Table 2 Overall Ranking for Rainfall Products in Estimating Daily Rainfall

5.2.2 Grid-to-Grid comparison

Similar to the point-to-grid comparison, the different rainfall products showed different levels of accuracy in estimating the daily rainfall values in grid-to-grid comparison (Table S5). AIMERG again performed best in terms of correlation, r = 0.72 (Fig. 4b), R2 of 0.51 and D = 0.76 but performed poorly for NSE (-0.63) and holds the highest mean error (2.21) and PBIAS (97) (Table S5). KGE indicated APHRODITE (0.49) as the best performer closely followed by MERRA2 (0.46) and MSWEP (0.45). PBIAS is the least for GSMap (15.8) and NSE is found to be low and negative for almost all the products indicating the limitation of the rainfall products in simulating the daily rainfall.

Based on the overall ranking (Table 2b), the top five rainfall products in grid-to-grid comparison were APHRODITE, MERRA2, REGEN, MSWEP and GSMAP. Interestingly AIMERG did not manage to make it in the top 5 while it served as the 3rd best performer in point-to-grid comparison. On the other hand, PCCS, CFSR, TRMM, CPCUPP and P-CDR were the least-performing products at daily time steps which is in line with the results for point-to-grid comparison. Note that APHRODITE is an IDW interpolated product based on several rain gauge data which is similar to our reference IMD-grid data sets. Thus, the performance of APHRODITE may be biased to the reference dataset and favored during the grid-to-grid evaluation process.

Ultimately, MERRA2 can be considered as a single outstanding product in the overall ranking for daily rainfall estimates where it secured 1st in the point-to-grid evaluation and holds 2nd rank in grid-to-grid evaluation. Thus, it is concluded that the MERRA2 datasets can be used for hydrological or hydraulic models in the study area featuring monsoon climate that requires daily rainfall datasets to simulate the water balance components and flood and drought monitoring.

5.3 Continuous statistical analysis at weekly time steps

The daily rainfall data were accumulated to weekly total rainfall and the performance was evaluated at weekly time steps. Similar to the comparison at daily time scale, the performance measures at weekly time steps showed striking differences between the rainfall products.

5.3.1 Point-to-Grid comparison

At different evaluation metrics, we found that the rainfall products at weekly time steps performed better than daily. All the products showed medium to strong correlations with the ground measured data (r ranges from 0.66 to 0.91, Fig. 5a). The majority of the products underestimated the rainfall, whereas some of the products such as AIMERG, ERA5, P-CDR and TRMM overestimated the rainfall values. The NSE and KGE of all the products were positive and better than at the daily evaluation.

Fig. 5
figure 5

Taylor diagram at weekly time steps

Referring to Table S6, MSWEP is found to be the best performer in terms of r (0.9), R2 (0.84), MAE (0.92), MSE (5.37), RMSE (2.32), PBIAS (-6.6), NSE (0.83), MD (0.87) and KGE (0.84) closely followed by MERRA2. Although AIMERG showed a strong correlation with the ground measured data (r = 0.9), it, however, performed comparatively lower in terms of RMSE, MAE, MSE, NSE and KGE.

Based on the overall ranking (Table 3a), the top five rainfall products are MSWEP, MERRA2, GPCC, REGEN and APHRODITE. MSWEP (which scored 5th ranking at daily time step) performed as best at weekly time steps with a least composite score of 23 only. The second-best performer was MERRA2 (the best performer at daily time step) which shows its consistently good performance. The worst performers were CFSR and PERSIANN group of products with a composite score of up to 219. AIMERG, the third best performer at daily time step, slipped down to the 15th position whereas TRMM scored 16th and consistently performed poorly (17th ranking at daily time step).

Table 3 Overall Ranking for Rainfall Products in Estimating Weekly Rainfall Amount

5.3.2 Grid-to-Grid comparison

APHRODITE, which ranked 5th in point to grid comparison, was the best performer in grid-to-grid comparison. The second-best performer was MSWEP (r = 0.83 (Fig. 5b), R2 = 0.7, ME = 0.4, MSE = 8.67, RMSE = 2.94, NSE = 0.64 and KSE = 0.75) followed by MERRA2 (Table S7).

In the overall comparison (Table 3b), ultimately MSWEP can be considered as the best performer at weekly time steps closely followed by MERRA2. Both datasets were found to be good performers at point to grid and grid to grid comparison.

5.4 Continuous statistical analysis at monthly time steps

5.4.1 Point-to-Grid comparison

Unlike daily and weekly time steps, most of the rainfall products exhibited very good performance at monthly time steps. The correlation value ranged from 0.79 (TERRA) to MSWEP (0.97), and Fig. 6a is showing strong agreement with the measured data. The NSE of all the products were above 0.6 indicating closeness to the measured values except for AIMERG (NSE = 0.32) and CFSR (NSE = 0.57) (Table S8).

Fig. 6
figure 6

Taylor diagram at monthly time steps

The five top performers in the overall ranking were MSWEP, MERRA2, IMERG, GPCC and REGEN (Table 4a). MSWEP and MERRA, the best performers at daily and weekly time steps, maintained their consistently best performance at monthly time steps as well. However, IMERG jumped from 7 and 10th at weekly and daily time steps, respectively to 3rd rank in monthly time steps.

Table 4 Overall Ranking for Rainfall Products in Estimating Monthly Rainfall Amount

5.4.2 Grid-to-Grid Comparison

In grid-to-grid comparison, there was a striking difference between the evaluation results under different reference datasets (Table S9, Fig. 6b). The top five best performers were APHRODITE, CPCUPP, MSWEP, MERRA2, and REGEN (Table 4b). It is interesting to note that CPCUPP, which was a low performer (ranked 12 on weekly and ranked 16 on the daily time scale), ranked 2 on monthly time steps. MSWEP and MERRA2 were still holding a ranking in the top 5 products.

Considering both the reference datasets, the best overall rainfall product at monthly time steps was again MSWEP followed by MERRA2 (Table 4).

5.5 Continuous statistical analysis at annual time steps

5.5.1 Point-to-Grid Comparison

In the overall ranking (Table 5a), the top 5 performing rainfall products at annual time steps were IMERG, MSWEP, MERRA2, CHIRPS and REGEN. IMERG outperformed other products with the lowest ME (-7.8), MAE (86), RMSE (121.1), RSR (0.38), PBAIS (-0.8) and highest NSE (0.85), KGE (0.78), VE (0.92), MD (0.78) and D (0.95) (Table S10). Interestingly MSWEP and MERRA2 again maintained their good performance at annual time steps (Fig. 7a) as well, while CHIRPS and REGEN improved ranking to top 5.

Table 5 Overall Ranking for Rainfall Products in Estimating Annual Rainfall Amount
Fig. 7
figure 7

Taylor diagram at annual time steps

Twelve (12) rainfall products showed underestimation with negative PBIAS and ME (Table S10). IMD-GRID showed the highest underestimation of actual measured rainfall values with PBIAS (-23.4) and ME (-235.4). In contrast, eleven rainfall products showed overestimation with positive PBAIS and ME. AIMERG was the worst performer with the highest PBIAS (55) and ME (573.3). Furthermore, the worst performers in terms of NSE was AIMERG (-2.11), CFSR (-0.34) and IMD-GRID (-0.16).

5.5.2 Grid-to-Grid Comparison

IMD-GRID is considered as a reference dataset in grid-to-grid comparison, which was found to be not well-performing in point-to-grid comparison. In fact, it was the second-worst performer in point-to-grid comparison. Thus, this was also reflected in grid-to-grid comparison where the worst performers in point-to-grid comparison improved their performance ranking in grid-to-grid comparison, simply because the reference dataset was changed to IMD-GRID (Table S11, Fig. 7b).

The top five performers in the overall ranking were CPCUPP, CMORPH, GSMAP, MSWEP, and REGEN. MERRA2 slipped to number 6 (Table 5b). However, both MSWEP and MERRA2 still maintained good performance with low composite scores.

5.6 Rain detection ability assessment

5.6.1 Point-to-Grid Comparison

IMD GRID demonstrated the lowest FAR (0.32), followed by GPCC (0.38) and AIMERG (0.40) (Table 6). In contrast, PERSIANN products performed poorly with the highest FAR value of PDIR (0.56), PCDS (0.55) and PCDR (0.54). Thus, IMD GRID emerged as a compelling option, indicating its superior ability to minimize false rainfall prediction. In terms of POD, ERA5AG achieved the highest value (0.90), followed closely by ERA5 (0.88) and ERAL (0.88). TRMM, CHIRPS, GSMAP, and CMORPH scored the lowest POD value of (0.51, 0.53, 0.59, and 0.59), respectively, indicating that these products have comparatively lower ability to correctly detect rainfall events. CSI, which is a function of POD and FAR and provides a more balanced score, showed that AIMERG performed best with the highest score of 0.57 followed by MERRA 2 (0.55), MSWEP (0.52) and GPCC (0.52), whereas TRMM and CHIRPS performed least with a score of 0.38.

Table 6 Summary of results of the categorical evaluation matrices for point-to-grid comparison

In terms of BIAS, IMERG (-0.01) and TRMM (0.02) had the least negative and positive bias, respectively, whereas IMD GRID (-0.23) and AIMERG (0.55) showed the maximum underestimation and overestimation of rainfall less than 0.5 mm. Furthermore, our findings revealed that AIMERG and MERRA2 stood out with the highest TSS of 0.72, followed by ERA5AG (0.66) demonstrating the ability of these products to successfully balance between predictions and false alarms. On the contrary, TRMM (0.42) exhibited the least TSS followed by CHIRPS (0.43) and CPCUPP (0.47). With 0.87 the highest Accuracy was found for AIMERG and IMD GRID, though in general all rainfall products showed good results except for the PERSIANN products PDIR, PCDS and PCDR who achieved the lowest Accuracy of 0.78. Similarly, AIMERG, MERRA2, IMD-GRID, GPCC and MSWEP consistently outperformed other rainfall products in TSS, HSS and ETS. In contrast, CHIRPS, CPCUPP, TRMM and PERSIANN least performed on these matrices. In terms of AUC, PDIR (0.87) was the top performer followed by AIMERG (0.86) and MERRA2 (0.85). Conversely, CHIRPS (0.71) achieved the lowest AUC followed by PCCS (0.72) and CPCUPP (0.73). Note that PERSIANN products that were low performers in the continuous statistics were exhibiting good performance in the rainfall detection ability.

Overall, AIMERG followed by MERRA2, ERA5 and ERA5Ag consistently demonstrated superior performance across the majority of categorical matrices justifying their ability to accurately predict rainfall events in the study area.

5.6.2 Grid-to-Grid Comparison

Grid-to-grid comparison (Table 7) showed that AIMERG (0.40) had the lowest FAR, followed by GPCC (0.42), MSWEP (0.43) and CMORPH (0.43). On the other hand, PERSIANN products again performed relatively poorer with higher FAR values. In terms of POD, ERA5AG had the highest value of POD at 0.89, closely followed by ERA5 and ERAL at 0.88. Conversely, TRMM, CHIRPS, GSMAP, and CMORPH had the lowest POD values at 0.52, 0.55, 0.60, and 0.61, respectively, which shows that these products have a relatively lower ability to accurately detect rainfall events. The balanced score CSI revealed that AIMERG (0.54) performed best followed by MERRA2 (0.49). Considering BIAS, all the rainfall products showed a positive bias. With 0.97 AIMERG exhibited the highest positive bias, followed by P-CDR (0.49), CFSR (0.47) and PCCS (0.46). With 0.16 REGEN, GSMAP, and CPCUPP showed the least positive bias. The best performer in the matrices TSS, HSS and ETS was AIMERG (0.7, 0.62, and 0.45, respectively) followed by MERRA 2 (0.67, 0.55, and 0.38) whereas the lowest performers were again TRMM and CHIRPS. Accuracy ranged from 0.87 (best score for AIMERG) to 0.77 (lowest score for PERSIANN products PDIR, PCDS and PCDR). Additionally, AIMERG had the best AUC score of 0.85, followed by MERRA-2 (0.84), while with 0.71 and 0.72 TRMM and CHIRPS had the lowest performance, respectively.

Table 7 Summary of results of the categorical evaluation matrices for grid-to-grid comparison

Similar to the results of point-to-grid comparison, AIMERG followed by MERRA2, ERA5 and ERA5Ag consistently showed superior performance in several statistical matrices in accurately predicting rainfall events in the study area.

5.7 Rainfall Intensity Assessment: Probability distribution function (PDF) of rainfall intensity

We computed the PDF to investigate the frequency of daily rainfall distribution by different rainfall products (Fig. 8). The x axis represents rainfall intensity in millimeters per day, while the y axis represents probability density in percentage. Each rainfall products are visually distinguished by a unique color whereas the dashed black line represents the observed rainfall data .

Fig. 8
figure 8

Probability distribution function (PDF) representing different rainfall intensities for 20 rainfall products with reference observed rainfall data

The number of rainfall events (≥ 0.1 mm/day) accounted only for 22.2% of all rainfall observations from 2000 to 2015. Moreover, there was no rainfall recorded in either the trace (0.01 to 0.04 mm/day) or the extremely heavy rainfall (≥ 244.4 mm/day) categories.

Overall, all rainfall products exhibited similar rainfall frequency characteristics compared to the observed rain gauge data (Patna airport station reference data), with a certain degree of deviation (error). Both CHIRPS and TRMM tended to overestimate the occurrence of no rain by 1.99% and 3.06%, respectively. Conversely, they tended to underestimate the occurrence of very light rain by 6.51% and 4.2%, respectively. On the other hand, the remaining rainfall products slightly underestimated the occurrence of no rain and overestimated the occurrence of very light rain. In the no rain category, the least underestimated rainfall products were GSMAP (2.8%), PCCS (3.2%) and IMDGRID (3.5%), while the least overestimated products in the very light rainfall category were PCCS (1.04%), GSMAP (2.4%) and CMORPH (3.7%). In the light rainfall category, all the rainfall products overestimated the observed rainfall (Patna airport station reference data), where IMD Grid (0.11%) followed by PCCS (0.32%) and GSMAP (0.39%) were the least overestimated products, respectively. Similarly, in the moderate rainfall category, all the rainfall products overestimated the observed rainfall with IMD Grid and CMORPH least overestimating (0.34%) followed by GSMAP (0.71%). AIMERG and PCCS overestimated rather heavy rainfall by 0.46% and 0.34%, respectively, while the rest of the rainfall products underestimated rather heavy rainfall with IMERG (0.21%) and GSMAP (0.23%) recorded the least underestimation. In the heavy and very heavy rainfall categories, AIMERG was the least overestimated product compared to the observed rainfall at Patna airport station, with only 0.43% and 0.16% deviation, respectively. On the other hand, the least underestimated heavy rainfall products were PCCS (0.14%), CMORPH (0.16%), and IMERG (0.18%). In the very heavy rainfall category, CMORPH, CFSR, and TRMM showed no deviation from the observed data. Overall, the rainfall products, including IMD GRID, GSMAP, PCCS, AIMERG, IMERG, and CMORPH, performed well in estimating rainfall frequency at different rainfall intensity categories.

South Asia is the most prominent monsoon region globally, and precise precipitation datasets are crucial for understanding the monsoon phenomenon. Several gridded rainfall products reported greater inter-annual variability in monsoon season rainfall compared to the actual observations (Sun et al. 2018; Shah & Mishra, 2016). The performance of different rainfall products can differ spatially at different geographical locations even when having the same climatic features such as Indian monsoon regions. In general, extreme rainfall frequency is overestimated in TRMM products over India, except in northern India and the Western Ghats (Rana et al. 2015; Shah and Mishra 2016). This is in line with our findings where TRMM is overestimating the rainfall amount in all four temporal resolutions (Tables S1 to S8). Prakash et al. (2014) evaluated the performance of three multi-satellites in Indian monsoon conditions using the IMD gridded rainfall dataset as a reference. They found that TRMM performed the best among CMPORH and PERSIANN. However, in our study TRMM performed comparatively lower than CMORPH and PERSIANN (Table 2). It is to be noted that Prakash et al. (2014) compared only three rainfall products, all of which performed poorly in our overall ranking of 23 rainfall products. Chen et al. (2020) found that in a monsoon-climate watershed in China, CMORPH BLD performed best, followed by MSWEP, CMORPH CRT, and TRMM, whereas we found poor performance of CMORPH compared to MSWEP in all the temporal scales reflecting that different rainfall products perform differently in different regions. Compared to TMPA estimates, Prakash et al. (2018) recorded notable improvement in rainfall detection over India with IMERG and GSMaP and stressed that IMERG has the added advantage of a smaller bias in rainfall compared to TMPA and GSMaP. Similarly, we found a relatively smaller negative bias for IMERG (-0.8) than for GSMaP (-12.2) and TRMM (1.8). Contrary to most of the earlier studies, where only a few rainfall products were compared, we analyzed a wide range of major rainfall datasets. This enabled us to reveal the existence of a variety of good performing rainfall products such as MERRA2, MSWEP, and AIMERG (which is an improved version of IMERG specifically developed for the Asian region) representing monsoon climate. Notably, these three rainfall products were not sufficiently considered in previous studies. Limiting the selection of rainfall products for comparison affects the ability to choose the best one.

Previous studies reported different rankings for the performance of different rainfall products at different temporal scales. We noted increased performance of rainfall products from daily to weekly and weekly to monthly time resolution, corroborating earlier findings (Wedajo et al. 2021; Baez-Villanueva et al. 2018; Dembélé and Zwart 2016). Improved performance with increasing time steps might be due to counterbalancing of variabilities when accumulated from smaller to larger time scales. However, it did not hold true for annual resolution where the performance of all the tested rainfall products here decreased, similar to (Wedajo et al. 2021; Dembélé and Zwart 2016). One reason might be less number of rainfall observations at annual time steps in the evaluation process. Thus, a single disagreement with the observed data will reflect a significant declining performance of the rainfall product in different statistical metrics.

6 Conclusion

Our findings reveal an explicit scale and time dependence in rainfall product performance for point-to-grid and grid-to-grid comparisons. For instance, in point-to-grid comparisons, MERRA 2 performs best at a daily time, whereas MSWEP which is the 5th best performer at the daily scale emerged as the top performer at weekly and monthly scales, and second at the annual time scale after IMERG, which performed best at the annual dataset indicating its suitability in capturing long-term trends. When it comes to grid-to-grid comparisons APHRODITE consistently performs best except at the annual scale suggesting its optimization at gridded data. AIMERG which shows a dismal performance across the different scales except for daily (ranked 3rd) in point-to-grid comparison exhibits the best performance in several categorical metrics depicting its superior ability to detect rainfall events accurately whereas showed poor performance in estimating rainfall amount. Contrary to this MERRA2 and MSWEP which are top performers in continuous metrics and in overall ranking (composite score), also performed good in several categorical metrics showing their high potential in both, accurate estimation of rainfall as well as detection of rainfall events respectively.

Previous studies that evaluated the performance of rainfall products in monsoon climate were limited in scope, as they only considered a few rainfall products and shorter periods (Wedajo et al. 2021; Prakash et al. 2018; Bajracharya et al. 2010; Chanyatham and Kirtsaeng 2011; Chu et al. 2011). Our study goes beyond by providing a comprehensive and longitudinal assessment of a diverse range of freely available rainfall products. This allowed us to uncover new insights by considering the performance of more (including the latest) rainfall products and utilizing their potential applications. For instance, MERRA2 and MSWEP, the top performers in our case, were not even considered in most previous studies. Our developed evaluation framework, thus, helps to improve the selection of the most suitable rainfall product for a specific region, thereby improving the accuracy and reliability of rainfall estimation, which, in turn, enhances their potential impact on decision-making and risk management. Although our study focuses on a single location, Patna (monsoon dominated urban landscape), this rigorous analysis of rainfall products offers a valuable and insightful foundation for understanding their performance in real-world settings and our approach can be applied to other areas worldwide. Consequently, it paves the way for more informed and improved rainfall product selection to be used in further applications. It is recommended to consider more locations with rainfall validation datasets in future studies to increase confidence and accuracy in selecting the best rainfall product. Combining several statistics into one composite score seems convincing but it has its own limitations as we have given equal weights to all statistical indicators. Furthermore, our study is intended to assess the overall performance of the evaluation metrics in rainfall products without considering a single application, assigning weight to metrics and judging which statistical method is more important than the other is not applicable in our study. Additionally, since no single statistical indicator is perfect, we provided the analysis results based on several individual statistical indicators and readers can choose which method fits well for their specific area of interest and application.