1 Introduction

Precipitation is one of the meteorological phenomena with the most significant impact on weather-dependent human activities (Torres-López et al. 2022), water resources (Anik et al. 2023) or agricultural water scarcity (Liu et al. 2023), among others. Weather forecasts, and specifically precipitation forecasts, are inherently uncertain. This uncertainty comes from the model physics and its initial and boundary conditions. So every weather forecast has some degree of uncertainty (Lorenz 1963). For some applications, only forecasts with an uncertainty estimate are valuable. Although it is computationally highly demanding, the best method for estimating the reliability of individual forecasts is to perform a set of numerical weather simulations (Scher and Messori 2018). AI, specifically Machine Learning (ML), can be utilized to manage the uncertainty arising from input data. Notably, ML demands significant computational power primarily during the training phase, offering a potential alternative for calculating the uncertainty in weather forecasts.

In this regard, diverse approaches have been implemented. For example, some works in the literature apply ML techniques to different fields related to forecasting, focusing on wind-related predictions.  (Irrgang et al. 2020) tries to predict the uncertainty associated using a supervised learning approach with a recurrent neural network trained and tested with data from 2012 to 2017. In (Kosovic et al. 2020), they aim to measure the uncertainty of wind forecasts obtained through NWP models. Finally,  (Wang et al. 2019) aims to obtain the uncertainty associated with the temperature, relative humidity and wind speed for each weather station they get data from using a dataset with 3-year forecasts from 10 weather stations in the Beijing region (China). Other works, such as (Hafeez et al. 2020; Bogner et al. 2019; Yang et al. 2020), focus on energy-related predictions.

Precipitation forecasting is essential and some works propose ML techniques to obtain precipitation predictions for specific locations, such as using convolutional neural networks to predict rainfall in a flood-causing area of Iran  (Afshari Nia et al. 2023), to use UltraBoost, Stochastic Gradient Descending and Cost Sensitive Forest classifiers for flood prevention in Romania  (Costache et al. 2022) or to forecast irrigation water requirements by evaluating ML models  (Mokhtar et al. 2023). In (Parviz et al. 2023) introduced improved hybrid models combined by SVR and GMDH, used for representing the nonlinear component of the precipitation in two weather stations in humid and semi-arid climates in Iran.

Finally, a variety of ML methods have been successfully implemented to solve the problem of weather predictions, both for classification and regression problems, demonstrating its advantages in this area by reviewing state-of-the-art ML concepts, their applicability to meteorological data, and their relevant statistical properties (Schultz et al. 2021). In (Castillo-Botón et al. 2022), shallow ML classification and regression algorithms are used to forecast the orographic fog in the A-8 motor road in Spain. Some authors consider whether it is possible to completely replace current numerical weather models and data assimilation systems with deep learning approaches (Schultz et al. 2021). Using methods based on deep learning with artificial convolutional neural networks that are trained on past weather forecasts can be another solution to indicate whether the predictability is different than usual (Scher and Messori 2018).

This work focuses on precipitation, a meteorological risk for society since severe rainfall causes flooding or ruins crops. Knowing precipitation at a specific location allows for preventing its effects. Precipitation forecasting is crucial but challenging in numerical weather models due to the involvement of multiple physical parameterizations, including longwave and shortwave radiation, convection, microphysics in mixed phases, turbulence, and planetary boundary layer processes (Tapiador et al. 2019). Therefore, getting an uncertainty index for precipitation forecasts may help decision-makers decide preventive actions. However, getting an uncertainty index in real-time is impossible since we require ground-truth data to compute the error in a forecast. Consequently, we aim to get an uncertainty index for WRF precipitation forecasts using an ML-based prediction model instead of ground-truth data to compute the forecast error.

We present an 11-year dataset from 2008 to 2018 with WRF forecasts and ground-truth precipitation data in the Ebro basin (Spain) available online under the name “Assessment of uncertainty in weather forecasts”. We use it to train our ML model, which allows us to calculate the real-time uncertainty index associated with precipitation forecasts to meet the need for ground truth. We also compare the results obtained with ground truth data and our model.

The rest of the document is organized as follows: Section 2 describes the experiments carried out and the materials and methods used to evaluate our proposal. Then, results are shown in Section 3 and discussed in Section 4. Finally, Section 5 summarizes the conclusions and future lines of research.

2 Materials and Methods

As mentioned above, we aim to get an ML-based uncertainty index for WRF precipitation forecasts. The sections below depict the uncertainty index computation, the data gathering, the classification models’ fitting to calculate the uncertainty in real time, and the evaluation method.

2.1 Computation of Uncertainty

Calculating uncertainty in weather predictions is crucial as it provides a quantitative measure of the reliability and accuracy of the forecasts, enabling decision-makers to assess potential risks and plan more effectively in response to varying climatic events.

Evaluating the error in the precipitation forecasts from a prediction model is possible by applying a cost function as the Root of the Mean Square Error (RMSE) (Wang et al. 2019). RMSE, a commonly used metric in regression tasks, measures the error between two datasets: the predicted values from a model and the actual ground-truth data. It is particularly effective in penalizing larger errors. RMSE is calculated as shown in Eq. (1). m is the number of cells in the grid where the study area discretizes, \(x_i\) depicts the precipitation value predicted by the WRF model for cell i, and \(y_i\) depicts the actual precipitation value in cell i.

$$\begin{aligned} RMSE = \sqrt{\frac{1}{m} \sum ^m_{i = 1} (x_i - y_i)^2} \end{aligned}$$
(1)

We require ground-truth data collected through the appropriate instrumentation to calculate precipitation error. Unfortunately, this fact makes it impossible to calculate the error in real-time, which prevents it from being used as an uncertainty value. Our research intends to replace the \(y_i\) values, corresponding to the actual precipitation value mentioned above, for those provided by a prediction model developed using supervised ML techniques from real-world data. So, our uncertainty index (\(\mathcal {U}\)) will be computed as shown in Eq. (2), where \(p_i\) depicts the predicted value in cell i of our ML-based model.

$$\begin{aligned} \mathcal {U} = \sqrt{\frac{1}{m} \sum ^m_{i = 1} (x_i - p_i)^2} \end{aligned}$$
(2)

Therefore, one of our objectives is to get an ML-based model. Fitting a supervised ML-based model requires gathering a dataset. Furthermore, this dataset must contain features –aka predictors– and target variables –aka labels. Predictors come from postprocessing the weather forecasts generated by the WRF model. Labels –“rain” or “no rain” in our research– come from ground-truth data in the study area.

2.2 Study Area

The Ebro Valley, in the Northeast of Spain (see Fig. 1), is one of the regions in Europe with the highest number of summer convective storms that cause intense and heavy rain and hail precipitation (García-Ortega et al. 2014). With an 80,000 \(km^2\) area, it is the largest hydrographic basin in Spain. Besides, it presents a significant heterogeneity in its geology, topography and climate. The average annual rainfall varies from 2100 mm in the Pyrenees to 350 mm in the arid areas. The height varies from sea level on the Mediterranean coast to 3372 m at the Pyrenees. The topography varies from one basin to another (Samper et al. 2007).

The Automatic Hydrological Information System (SAIH) in the Ebro basin has a network of 367 meteorological stations collecting ground-truth data such as temperature (C) and rainfall (mm), among others. SAIH Ebro depends on the Ebro Hydrographic Confederation (CHEbro), which is the organism in charge of managing, regularising, and maintaining the waters and irrigations of the Ebro basin.

Fig. 1
figure 1

Hydrographic Demarcation of the Ebro River Basin

2.3 Data Gathering

To construct our ML-based model, it’s essential to compile a labeled dataset derived from the post-processing of WRF model weather forecasts. The dataset builds from the postprocessing of the weather forecasts generated by the WRF model. The WRF forecasts and their postprocessing have been carried out in the Caléndula, the supercomputer of Supercomputing Center Castile and León (SCAYLE), León (Spain) and one of the 17 supercomputers that conform to the Spanish Supercomputing Network (RES). The datasetFootnote 1 is available online.

2.3.1 Predictors

In the ML context, predictors are the input data mapped to a label through an empirical relationship. In our research, predictors correspond to meteorological variables computed by the WRF-ARW model (Skamarock et al. 2005). It carries out the complete workflow to assimilate observations into the model and runs in High-performance computing (HPC) environments.

The WRF model requires input data to set initial and boundary conditions (IC and BC). We gather input data from the National Centers for Environmental Prediction (NCEP) operational Global Forecast System (GFS) (National Centers for Environmental Prediction, National Weather Service, NOAA, U.S. Department of Commerce 2000). The NCEP operational GFS analysis and forecast grids are on a 0.25 by 0.25 global latitude-longitude grid. Grids include analysis and forecast time steps at a three-hour interval from 0 to 240 and a 12-hour interval from 240 to 384.

After getting input data for IC and BC, we run the WRF model to get 24-hour forecasts from January 2008 to December 2018 with two nested domains of \(9\times 9\) and \(3\times 3\) km resolution, as shown in Fig. 2. There are different physics schemes available in the WRF model. Physics parameterization schemes describe sub-grid processes in numerical simulation models, such as the in-cloud microphysical processes responsible for precipitation (Tapiador et al. 2019). According to previous results in the study area (Merino et al. 2022), the Goddard Cumulus Ensemble one-moment bulk microphysics scheme (Tao et al. 1989, 2009) was selected. For cumulus, the Grell-Devenyi ensemble cumulus scheme (Grell and Dévényi 2002) for the outer domain. Convection in the inner domain was explicitly resolved. Other schemes selected were the Dudhia scheme (Dudhia 1989) for shortwave radiation, the Rapid Radiative Transfer Model (Mlawer et al. 1997) for longwave radiation, and the Noah Land Surface Model (Chen and Dudhia 2001).

Fig. 2
figure 2

Forecast domains

The WRF outputs are in NetCDF format –intended to store multi-dimensional scientific data. Output files gather the values of up to 100 meteorological variables. We get the temperature and mixing ratio variables at different pressure levels (500 hPa, 700 hPa, and 850 hPa). The whole list is shown in Table 1. They will be our predictors. Besides, temperature and mixing ratio variables are not accumulative, so we get them with a 3-hour interval from 0 to 21 h. Since we consider seven variables at three pressure levels eight times daily, we get 168 values for each grid point in the study area depicted by its latitude, longitude, and height above the sea. We store our predictors in a separate NetCDF file for postprocessing once the labels are included within the dataset.

Table 1 Predictors

2.3.2 Labels

Once we gather our predictors, we must add the label –“rain” or “no rain”– to every sample in our dataset. We get such a label from the ground-truth data provided by the SAIH Ebro. The “rain” or “no rain” labels in our study are determined based on data collected by the rain gauges of the SAIH Ebro network, ensuring that our classifications accurately reflect the actual precipitation events recorded in the region. As mentioned, the SAIH Ebro manages a network of 367 meteorological stations that collect temperature (\(^oC\)) and rainfall (mm) values. Specifically, we gather rainfall values from 2008 to 2018 with their corresponding latitude and longitude coordinates. We apply the “rain” label when accumulated precipitation exceeds 0 mm and “no rain” otherwise.

2.3.3 Post-Processing

Finally, some data curation is necessary to get the dataset. First, to set the suitable class for each sample, data must interpolate to the defined inner domain shown in Fig. 2. Therefore, we use the interpolation method described in (Merino et al. 2021). This approach involves rigorous quality control functions to identify and remove suspect data, interpolation techniques to align data within the defined study domain, and reconstruction methods for gap filling. This ensures that the dataset is not only comprehensive but also maintains high accuracy and reliability for subsequent analysis.

Next, after interpolating, the original dataset was filtered using quality control functions for identifying and removing suspect data. Lastly, reconstruction techniques for filling gaps (Serrano-Notivoli et al. 2017) prevent the lack of such data from affecting the experiments. For example, Fig. 3a shows the precipitation results on November 26th, 2008, from ground truth data from SAIH Ebro. Figure 3b shows the precipitation forecast by WRF for the same date.

Fig. 3
figure 3

Precipitation estimate on November 26th, 2008

2.4 Model Fitting

We aim to build a prediction model whose inputs are meteorological variables obtained from WRF forecasts and whose output \(p_i\) is a precipitation presence indicator for a specific grid point i in the study area.

We use Model Evaluator (MoEv), a wrapper for the Scikit-Learn library (Pedregosa et al. 2011) to get our prediction models. MoEv has been successfully used in different research areas such as jamming attacks detection on real-time location systems (Guerrero-Higueras et al. 2018), or malicious-network-traffic detection (Campazas-Vega et al. 2020), among others.

We randomly split the dataset into 67% for the training set and 33% for the test set to get a training set for fitting the prediction models and a test set to ensure their generalization.

Then, we apply 10-fold cross-validation to fit prediction models. Since we need to predict a class –“rain” or “no rain”, classification algorithms are more suitable than regression or clustering algorithms. However, since data matters more than algorithms for complex problems (Halevy et al. 2009) we aim to evaluate classification, clustering, and regression algorithms to select the most accurate for this problem.

Specifically, we compute: Adaptative Boosting (AB) (Freund and Schapire 1997), Decision Tree (DT) (Safavian and Landgrebe 1991), DT-based Bagging (DT-B) (Breiman 1996), Linear Discriminant Analysis (LDA) (Balakrishnama and Ganapathiraju 1998), Logistic Regression (LR) (Zhu et al. 1997), Quadratic Discriminant Analysis (QDA) (Hastie et al. 2009), Random Forest (RF) (Breiman 2001) and Stochastic Gradient Descent (SGD) (Bottou 2012). We selected these specific models based on their proven effectiveness in handling complex, non-linear relationships inherent in meteorological data. Models like Random Forests and Decision Trees are robust to outliers and capable of capturing intricate patterns in data. Despite its simplicity, Logistic Regression provides a strong baseline for performance comparison. The diversity of these models, ranging from ensemble methods to linear classifiers, allows for a comprehensive evaluation of different algorithmic approaches in accurately predicting precipitation events, ensuring the selection of the most effective model for our specific dataset and study objectives.

2.5 Evaluation

To evaluate our proposal, first, we need to assess the performance of our prediction models to get the most accurate. Therefore, we calculate well-known Key Performance Indicators (KPIs). First, models’ performance is measured by considering their accuracy score as shown in Eq. 3), where \(T_P\) is the true-positive rate, \(T_N\) is the true-negative rate, \(F_P\) is the false-positive rate, and \(F_N\) is the false-negative rate.

$$\begin{aligned} Accuracy = \frac{T_P+T_N}{T_P+F_P+T_N+F_N} \end{aligned}$$
(3)

Besides, we consider the following KPIs obtained through the confusion matrix: Precision (\(\mathcal {P}\)), Recall (\(\mathcal {R}\)), and F\(_1\)-score (\(\mathcal {F}_1\)). \(\mathcal {P}\), \(\mathcal {R}\), and \(\mathcal {F}_1\) (Sokolova and Lapalme 2009) are computed as shown in Eqs. (4), (5), and (6). The \(\mathcal {P}\) score shows the ratio between the number of correct predictions (both negative and positive) and the total number of predictions. The \(\mathcal {R}\) score shows the rate of positive cases correctly identified by the algorithm. The \(\mathcal {F}_1\) score relates to both \(\mathcal {P}\) and \(\mathcal {R}\) since it is their harmonic mean (Hossin and Sulaiman 2015).

$$\begin{aligned} \mathcal {P} = \frac{T_P}{T_P+F_P} \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {R} = \frac{T_P}{T_P+F_N} \end{aligned}$$
(5)
$$\begin{aligned} \mathcal {F}_1 = 2 \frac{\mathcal {P} \times \mathcal {R}}{\mathcal {P} + \mathcal {R}} \end{aligned}$$
(6)

The chosen KPIs – Accuracy, \(\mathcal {P}\), \(\mathcal {R}\), \(\mathcal {F}_1\) – are particularly relevant for our study as they provide a holistic assessment of the model’s performance. Accuracy measures overall correctness, while precision and recall evaluate the model’s ability to correctly predict rain events and avoid false alarms. We require a high \(\mathcal {R}\) score, but only if the \(\mathcal {P}\) score is also high enough to ensure that there are not too many false negatives. Thus, the F\(_1\)-score is crucial as it balances precision and recall, especially important in imbalanced datasets.

Moreover, to evaluate the tradeoff between the \(T_P\) and \(F_P\) rates, we compute Receiver Operating Characteristic (ROC) curve. Besides, We have calculated the Area Under the Curve (AUC). AUC assesses the model’s ability to distinguish between the two classes (“rain” and “no rain“), ensuring that our model reliably predicts precipitation events, which is critical for effective weather forecasting.

Finally, we can compute \(\mathcal {U}\) according to Section 2.1 once we have selected the most accurate prediction model. Then, we carry out a statistical comparison between \(\mathcal {U}\) and the actual RMSE for each WRF forecast from January 2008 to December 2018. Such analysis allows for measuring the performance of \(\mathcal {U}\).

3 Results

Data gathering proposed in Section 2.3 allows for getting a 39 GB dataset of weather forecasts obtained by the WRF model from January 2008 to December 2018 in the study area. This dataset allows for fitting the prediction model we require to compute our uncertainty index \(\mathcal {U}\) (see Section 2.1). The dataset contains 19,885,973 samples corresponding to a grid point – depicted by its latitude, longitude, and height – on a specific date. Each sample has the 168 features shown in Section 2.3.1 and the labels shown in Section 2.3.2. The dataset is available online.

Fig. 4
figure 4

Confusion matrices of AB-, DT-, DT-B-, LDA-, LR-, QDA-, RF-, SGD-based prediction models

Figure 4 displays the confusion matrices for the proposed prediction models, which are essential for computing the accuracy, precision, recall, and F\(_1\)-score values listed in Table 2. Besides, Fig. 5 shows the evaluated models’ ROC curve and the AUC.

Fig. 5
figure 5

ROC and AUC

Table 2 Accuracy, precision (\(\mathcal {P}\)), recall (\(\mathcal {R}\)), and \(F_1\) score (\(\mathcal {F}_1\)) scores

After computing the proposed KPIs, we can select the best prediction model to calculate our uncertainty index \(\mathcal {U}\). Next, to evaluate \(\mathcal {U}\), we compare it with the RMSE on every weather forecast from January 2008 to December 2018. Figure 6 presents the descriptive statistics obtained by computing the RMSE and \(\mathcal {U}\) using different prediction models.

4 Discussion

The primary goal of this study was to establish the reliability of weather forecasts without the need for real-time data. This was achieved by focusing on three main aspects: first, developing an ML-based model as an alternative to ground truth data; then, creating a comprehensive and large dataset for training the model; and finally, comparing the results from our model with those obtained using ground truth data to assess our model’s reliability.

To address the challenge of evaluating the uncertainty of weather forecasts obtained by the WRF model without real-time ground truth data, we sought to find an ML-based prediction model as a substitute. The reliability of this model was paramount, hence the computation of \(\mathcal {P}\), \(\mathcal {R}\), \(\mathcal {F}_1\), ROC curves, and AUC to evaluate and select the most effective model. Previous studies have employed methods such as convolutional neural networks (Afshari Nia et al. 2023), UltraBoost and Cost-Sensitive Forest classifiers (Costache et al. 2022), and tree-based algorithms (Yang et al. 2020), often applied in different work areas or with configurations unsuitable for WRF predictions.

In terms of data gathering, unlike other works (Wang et al. 2019; Ahmad et al. 2016) that made use of datasets not specific to our selected area or were limited in size, our approach involved creating a tailored dataset with predictors specifically chosen for our study area, as detailed in Table 1. Each sample in this dataset was labelled as “rain” or “no rain” using ground-truth data from SAIH Ebro, as explained in Section 2.1.

Fig. 6
figure 6

RMSE for Ground Truth data and \(\mathcal {U}\) with the best four prediction models

The confusion matrices for each model studied, represented in Fig. 4, show that the number of true positives is above 4 million for most models. Notably, the three classifiers with the highest accuracy, RF, DT-B, and DT, exhibit a similar rate of true negatives and low rates of failure (between 18% and less than 10%). This direct relationship between our study’s best and worst classifiers is evident in their performance in terms of false positives and negatives.

Table 2 presents the accuracy scores and the \(\mathcal {P}\), \(\mathcal {R}\), and \(\mathcal {F}_1\) scores for all tested classification models. While classifiers like LDA, LR, or SGD obtain satisfactory \(\mathcal {R}\) scores, their \(\mathcal {P}\) scores barely exceed 0.70. In contrast, models such as DT or DT-B show strong performance in both KPIs. The RF classifier emerges as the most effective, with the highest \(\mathcal {P}\), \(\mathcal {R}\), and \(\mathcal {F}_1\) scores, making it the best classifier according to our results.

The evaluated models’ ROC curves and AUC, as shown in Fig. 5, further confirm the superior performance of the RF and B classifiers. Both DT and LDA models exhibit an AUC of 0.88, demonstrating their effectiveness in class prediction. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR), with TPR on the y-axis and FPR on the x-axis, providing a clear visual representation of model performance.

Finally, Fig. 6 displays a box-plot representation of the RMSE (Eq. 1) calculated for the ground truth and the \(\mathcal {U}\) uncertainty index computed for the four best prediction models used. These models, with an accuracy higher than 80%, particularly DT, B, and RF, present plots and data akin to ground truth. LDA follows as the fourth model with the best accuracy rate. The results confirm that the models with the highest accuracy rates exhibit a \(\mathcal {U}\) index similar to the RMSE of our ground truth, verifying the effectiveness of our approach.

5 Conclusions

This study underscores the vital role of accurate weather forecasting in sectors like agriculture and water resource management. To enhance forecast precision, we utilized classification models and compiled an extensive dataset of precipitation data. The dataset, exceeding 40GB in CSV format and divided into two 20GB subsets, is publicly available under the Creative Commons Attribution 4.0 International license, providing a valuable resource for the scientific community.

Our methodology involved constructing a dataset from historical forecasts and ground truth data from the SAIH Ebro pluviometer network. This comprehensive dataset, encompassing precipitation data over 4,017 days and 9,594 points from domain 2 (refer to Fig. 2), was meticulously curated to remove missing values, resulting in 19,885,973 viable samples. We developed an RMSE-based index utilizing an ML-based prediction model instead of traditional ground-truth data to quantify forecast uncertainty. Various classification models were constructed using MoEv, a versatile wrapper for the Scikit-Learn library.

The experimentation demonstrated the efficacy of supervised learning algorithms in predicting the uncertainty of weather forecasts, fulfilling the primary goal of this research. Among the tested models, the RF classifier emerged as the most proficient in detecting precipitation, as evidenced by multiple KPIs. Renowned for its high generalization capability, RF’s ensemble approach, integrating numerous decision trees with probability thresholds, proved suited for handling extensive data and many variables. Moreover, decision tree-based algorithms, namely DT and DT-B, exhibited superior performance compared to other evaluated models.

A notable breakthrough of this research is the assembly of a comprehensive dataset for the Ebro River basin region, spanning 11 years of WRF model forecasts. This dataset effectively supports the construction of models to estimate the uncertainty associated with a prediction. Additionally, we introduced a method to assess forecast uncertainty by calculating an uncertainty index \(\mathcal {U}\). Our findings reveal that \(\mathcal {U}\) closely aligns with values derived from ground truth, bolstering confidence in the forecast’s reliability.

Looking beyond precipitation, future research might explore ML-based models’ applicability to other meteorological variables, such as wind, hail, or snow, which significantly influence water resource variability and quality. Furthermore, contrasting our proposed models’ performance and computational efficiency against neural network-based models could offer valuable insights into enhancing the reliability and speed of meteorological predictions.