Machine learning models to complete rainfall time series databases affected by missing or anomalous data

In recent years, artificial intelligence in geosciences is spreading more and more, thanks to the availability of a large amount of data. In particular, the development of automatic raingauges networks allows to get rainfall data and makes these techniques effective, even if the performance of artificial intelligence models is a consequence of the coherency and quality of the input data. In this work, we intended to provide machine learning models capable of predicting rainfall data starting from the values of the nearest raingauges at one historic time point. Moreover, we investigated the influence of the anomalous input data on the prediction of rainfall data. We pursued these goals by applying machine learning models based on Linear Regression, LSTM and CNN architectures to several raingauges in Tuscany (central Italy). More than 75% of the cases show an R2 higher than 0.65 and a MAE lower than 4 mm. As expected, we emphasized a strong influence of the input data on the prediction capacity of the models. We quantified the model inaccuracy using the Pearson's correlation. Measurement anomalies in time series cause major errors in deep learning models. These anomalous data may be due to several factors such as temporary malfunctions of raingauges or weather conditions. We showed that, in both cases, the data-driven model features could highlight these situations, allowing a better management of the raingauges network and rainfall databases.


Introduction
Climate change is one of the most relevant issues for humanity in the Anthropocene era (Malhi et al. 2020).The average temperature on the mainland in the years 2006-2015 was 1.53 °C higher than that of the years 1850-1900(IPCC 2019)).It is now well known that higher temperature is causing severe changes in precipitation regimes as well, with increasingly extreme events (Hardwick Jones et al. 2010;Myhre et al. 2019;Tramblay et al. 2020;Luppichini et al. 2023b).It is also causing an alteration of the beginning and end of growing seasons, causing a general decrease in the regional crop yields and freshwater availability (Minoli et al. 2022).The biodiversity is further stressed and tree mortality increases (IPCC 2019).Understanding and modelling the past, present, and future climate are of fundamental importance to the issue of climate change and variability.Effective climate models represent one of our primary tools for projecting and adapting to climate change (Schmidt 2011).
Moreover, climate change has direct repercussion on the hydrogeological systems and groundwater resources, and the hydrogeological models are consequently part of the climate models (Amanambu et al. 2020;Li et al. 2022).Rainfall data, and precipitation in general, their variability, intensity, and duration, have paramount importance for the hydrogeological models (Sattari et al. 2017).Independently from the used model, several problems can still affect the input rainfall datasets.These issues often concern missing or incorrect information that can lead models to misleading results.It is worth noting that the geographical distribution of raingauges is generally not uniform (e.g., for some areas there is a deficiency or lack of raingauges).Furthermore, the completeness of the time series is not always guaranteed due to, for example, non-continuous operation of raingauges during the monitoring period (Lebay and Le 2020).
Many physically based methods simplify the natural system features to predict its behaviour (Antonetti and Zappa 2018).However, the natural systems are inherently heterogeneous (Marçais and de Dreuzy 2017) and the physically based methods may show inherent limitations in reproducing natural phenomena.In recent years, the use of artificial intelligence (AI) and graphical processing units (GPUs) have enabled remarkable advances in machine learning (and especially in deep learning) applications such as techniques based on multilayer artificial neural networks (ANNs).Deep learning models have been successfully applied in many forecasting situations, including time series forecasting (Zheng et al. 2019;Yi et al. 2019;Fawaz et al. 2020;Nigro et al. 2022).Time series typically have chaotic and noisy problems and deep learning approaches are the most effective techniques for solving them (Livieris et al. 2020).Several authors use the rainfall dataset to create deep learning models available to replicate run-off processes (van Loon and Williams 1976; Marçais and de Dreuzy 2017; Kratzert et al. 2018;Boulmaiz et al. 2020;Sit et al. 2020;Tien Bui et al. 2020;Chattopadhyay et al. 2020;Luppichini et al. 2022aLuppichini et al. , 2023a)).Long short-term memory (LSTM) and convolutional neural networks (CNNs) are two of the most popular, efficient, and used deep learning techniques (Zheng et al. 2019;Yi et al. 2019;Fawaz et al. 2020).In the last period, some works combined LSTM and CNN models for time series prediction (Kimura et al. 2019;Baek et al. 2020;Van et al. 2020;Xu et al. 2020).The benefits of the combined CNN-LSTM models are a consequence of the characteristic of LSTM of acquiring efficiently the information of sequence patterns, thanks to their peculiar architecture.The CNN layers filter out the noise in the input data to extract the most significant features needed for the final prediction model.Furthermore, standard CNN can identify spatial autocorrelation between data but is usually not suitable for a correct analysis of a complex temporal dependence over long times (Bengio et al. 2013;Livieris et al. 2020).Several works used deep learning models based on the LSTM networks to create run-off simulations (Kratzert et al. 2018;Le et al. 2019;Boulmaiz et al. 2020;Li et al. 2020;Liu et al. 2020;Nguyen and Bae 2020;Hu et al. 2020), whereas others based on CNN (Li et al. 2018;Huang et al. 2020;Kim and Song 2020;Hussain et al. 2020) or a combination of both (CNN-LSTM) (Kimura et al. 2019;Baek et al. 2020; Van et al. 2020;Xu et al. 2020).The performance of encoder-decoder LSTM layers (LSTM-ED) is great with sequential data like a time series.This architecture consists of two blocks: one to read the input sequence and encode it into a fixed-length vector, and a second one to decode the fixed-length vector and transmit the intended sequence (Sutskever et al. 2014).
Among several applications in hydrological modelling, deep learning models are also used for several additional applications, such as reconstructing missing data and predicting rainfall data (Gers et al. 2001).In these models, the input data must obviously be of high quantity and good quality.Furthermore, machine learning models are used to apply statistical regionalization procedures, which constitute a set of methodologies used to divide a geographical area into statistically homogeneous regions.The main aim of statistical regionalization is to simplify data understanding and analysis, allowing analysts to attain a more detailed and meaningful insight into local dynamics.Among the most common methodologies are cluster analysis, which groups similar geographic units based on relevant variables, and principal component analysis, which identifies common patterns of variation among the units (Yin et al. 2016;Alem et al. 2019;De Luca and Napolitano 2023).
This work intends to use machine learning models to predict rainfall data taking advantage of a network of sensors.The models recreate precipitation time series by using data from nearby raingauges as inputs.The training data lacks temporal information, but each record is referred to a specific time.This allows the missing data to be entered into a rainfall database, allowing to complete time series for applying several types of study requiring the time series continuity (e.g., statistical methods, trend analysis, etc.).We also wanted to analyse the errors of the models investigating the role of anomalous data that can influence the performance of deep learning models.Indeed, all meteorological databases can have anomalous data caused by rare natural phenomena or anthropic factors (e.g., malfunctions of the sensor network).Understanding the answer of the machine learning models to the presence of these data is a key point for future applications of AI techniques in hydrological and meteorological studies.
We applied three machine learning models: the first one is a linear regression (LR), whereas the second and the third ones are based on CNN and LSTM.The first architecture of the deep learning models relies on a combination of CNN and LSTM layers (CNN-LSTM), whereas the second one relies on ED-LSTM.The dataset used is derived from 349 raingauges located in Tuscany (central Italy; Fig. 1), characterized by an extensive monitoring network and a wide variability of the mean annual precipitation (MAP), which is influenced by the morphology of the territory (Cantù 1977;Rapetti and Vittorini 1994;Fratianni and Acquaotta 2017).Tuscany is indeed very heterogeneous from a morphological and a geological point of view, characterized by mountain ranges, extensive hilly areas, and some relatively large plains (Carmignani et al. 2013;Baroni et al. 2015).In summary, the study area allows to apply the methodology and the investigations in an area characterized by a great climate variability and with a great number of raingauges.The manuscript is composed by the following paragraphs: material and methods, where we explain the methodology and the data used; results, where we show the products of this work; and finally, discussion and conclusions, where we analyse the results, and we propose the main consequences of this work.

Database and data input pre-processing
The dataset used is provided by the Tuscany Region Hydrologic Service (SIR) and contains data acquired from several meteorological stations (https:// www.sir.tosca na.it/ consi stenza-rete).We collected the daily rainfall data by developing an automated download procedure through codes written in Python and HTTP protocol.The database derived from this procedure has also been used in different studies resulting reliable (Bini et al. 2021;Luppichini et al. 2021Luppichini et al. , 2022aLuppichini et al. , b, 2023b)).
The monitoring activity in Tuscany started in 1910 and the entire rainfall dataset is today composed of 1103 time series.The number of raingauges increased from around 100 in the early 1900s to 350 in the 1940s, when the war slowed this growth.From the post-war period until the early 2000s the number of active raingauges was about 300 per year.In the last 20 years the network reached a peak of about 400 raingauges distributed on a region of almost 23,000 km 2 .Each raingauge obviously has a different period of activity and some data may be missing within the time series.The territorial authority assigned a specific unique code to each sensor and when they move or change a sensor, they assign a different unique code.For this reason, each time series is assigned to a specific geographical condition and a specific sensor.The rainfall data used in this study are daily data, referring from 09:00 The validated data is a subset derived from processing and checks that allow to remove any sampling errors and reduce the presence of inconsistency in the dataset.The non-validated data are raw measurements that have not yet been checked for integrity and correctness.In this work, we chose to use only the validated data to minimize errors.Some raingauges have time series with an insufficient amount of data available for the creation of a deep learning model.From tests carried out, we decided to use only raingauges which provide time series of at least six years.Each deep learning model predicts the missing data of the output raingauge using three (an arbitrary number) input raingauges.For each output raingauge, we chose the input raingauges based on the geographical distance and the difference in elevation between the sensors.The maximum distance and the maximum difference in altitude considered among the output raingauge and input raingauges was 10 km and 100 m, respectively.When more than three selected input raingauges were present, we did a manual screening choosing the best combination of input raingauges, representing a reasonable compromise between data completeness and the extensiveness of the dataset.
After this procedure, we selected 349 output raingauges distributed in the study area.Figure 1 shows how the density of the stations is higher in northwestern Tuscany than in southeastern one, due to the greater variability of rainfall, which needs a more effective monitoring network.
The mathematical expression of the models, representative of all the investigated raingauges, can be defined as: where R is the predicted output raingauge rainfall at time t; R1 t ,R2 t and R3 t are the rainfall values of the three input rain- gauges at time t .An example of the input dataset is shown in Table 1.

Model development
To accomplish the deep learning models of this study, we mainly used the open-source framework Tensorflow (Abadi et al. 2015) and the libraries Numpy, Pandas, Scikit-Learn, and Keras (Chollet 2015) in Python language.The LR model is developed using the scikit-learn framework.We specifically selected two model architectures.The first one is composed of two CNNs layers of 64 and 128 filters (with a kernel size of 2 and stride length of 1), respectively, followed by a Max Pooling layer with size 1, an LSTM layer of 200 units, a dense layer of 50 neurons, and an output layer of one neuron (Fig. 2A).Usually, CNNs layers precede a pooling layer, which helps to reduce the size of the information while keeping the information unblemished.One pooling technique often used in CNN design is the Max Pooling (Zhou and Chellappa 1988).The second selected architecture is an encoder-decoder LSTM, with two LSTM nodes.Both the encoder and the decoder consist of a pair of sequence layers (LSTM) of 32 and 16 units followed by a repeat vector node for the encoder and 16 and 32 units for the decoder followed in turn by a time-distributed dense node (Fig. 2B).Each first LSTM layer returns the whole output sequence to the second one, instead the last ones return only the last hidden state.To evaluate the discrepancy between predicted and actual values, we used a loss function measured on each observation, which allowed us to calculate the cost function.We needed to minimize the cost function by identifying the optimized values for each weight.Thanks to multiple iterations, the optimization algorithms transfer the identification of the weights that minimize the cost function.In our implementation, we used the Adam optimizer (Kingma and Ba 2014), which is an adaptive learning speed method, namely it computes individual learning rates for several parameters (Kingma and Ba 2014).The activation function used for CNN-LSTM and ED-LSTM models is rectified linear units (ReLU) function (Agarap 2018).To stop the training, we used API of Keras and specifically the "early stopping" method, setting a number of epochs with no improvement after which the training is stopped at 200.This method allows the training procedure to stop when the monitored metric has stopped improving.The monitored metric was the value of the cost function.Given all the possible hypotheses, we wanted to find the best one (called "optimal"), namely the one that allowed us to make more precise estimates, always based on data in our possession.For each model, the input dataset is divided into three subsets called training, validation, and test datasets.The training and validation datasets are used during the learning phase.The test dataset is used afterwards to evaluate the quality of the model.In this way, we can determine the ability of the model to predict new cases not used during the learning phase.The training dataset is 60% of the primary dataset, whereas the test and validate datasets include the remaining 20% and 20%, respectively.This type of splitting is commonly used in the supervised training of deep learning models, allowing sufficient data for training and model quality verification (Gholami et al. 2015).The models were fitted using a division for batches.
For the training of the LR models we used the same training and test datasets used to the CNN-LSTM and ED-LSTM models.In this way, we can compare the results of the three models.In our models, the cost function used was the mean absolute error (MAE) calculated on the training dataset during the resolution of each batch and on the validation dataset at the end of each epoch.This procedure allowed to minimize the overfitting effect on the training set.

Evaluation of models
Each model is associated with some errors, the evaluation of which provides information on the performance of the model itself.In this work, we used the Mean Absolute Error (MAE).The MAE is an arithmetic mean of the absolute errors, and is one of the methods used to assess the model performance (Willmott and Matsuura 2005): In addition to the MAE, the average relative error was calculated for each raingauge: The models were moreover evaluated by the parameter R2, an index measuring the link between the variability of data and correctness of the statistical model used.Another method was used to estimate the performance of the models, which also made it possible to try to understand the cause of the errors of the prediction models.This method consists of taking the models errors for each raingauge: and correlating them with the value derived from the average of the rainfall at the three input raingauges ( R 1 , R 2 ,R 3 ) minus the rainfall amount of the output raingauge ( R 0 ) for the same day of the model errors: According to the Cauchy-Schwarz inequality, PCC ranges between + 1 and -1, where + 1 corresponds to perfect positive linear correlation, 0 corresponds to no correlation, and -1 corresponds to perfect negative linear correlation (Lee Rodgers and Alan Nice Wander 1988).

Results
Figure 3A shows the boxplots of MAE of the deep learning models of the 349 raingauges time series simulated.The median of the errors is about 3 mm for CNN-LSTM and ED-LSTM models, while LR model has a median of about 1 mm.
The R 2 values are reported in Fig. 3B in which we can recognize that the two deep learning models have similar R2, while LR model shows the best values of error metric.Figure 3C shows the average relative error for six daily rainfall bands considering all raingauges (0-1 mm, 1-3 mm, 3-5 mm, 5-10 mm, 10-30 mm, 30-50 mm).The figure shows the relative errors for the three model architectures are also very similar in these cases. (5) The analysis of the spatial distribution of errors does not denote a clustering or a specific spatial distribution (a case of MAE of CNN-LSTM models is reported in Fig. 4).
The errors on the training and validation dataset are monitored during the training time, but we can compare the MAE calculated on the test and validation dataset (training dataset for the LR model; Fig. 5), during the post-training phase, to evaluate if the models are not subject to overfitting.The MAEs calculated on the validation dataset and on the test dataset are comparable; the difference is present in a small range around 0, indicating a low degree of overfitting (Fig. 5).
Model errors are higher when the difference among input and output data is higher.Each model shows a strong correlation (median PCC values of 0.9) between the difference among the input and output values and the absolute error (Fig. 6).

Discussion
The errors of CNN and ED architectures are very similar, but the LR model is the best.If we compare the MAE of the two architectures for each raingauge, we can observe that they almost overlap (Fig. 7).The correlation between the two errors by the Pearson method gives a value of 0.93.
Comparing the errors of the models proposed in this study with those derived from other AI-based works or different approaches (e.g., mathematical, statistical) is Fig. 3 Absolute and relative errors of CNN-LSTM and ED-LSTM models: A) Mean absolute error (MAE); B) R 2 ; C) Relative Mean Absolute Error (RMAE).The boxes represent the interval between the 25th and 75th percentiles (Q1 and Q3).IQR is the interquartile range Q3-Q1.The upper whisker will extend to the last datum lower than Q3 + 1.5 × IQR.Similarly, the lower whisker will reach the first datum higher than Q1 -1.5 × IQR very complex.The differences depend on several factors such as the features of the study area, the spatial distribution of the raingauges, and the rainfall distribution and amount (de Silva et al. 2007).However, considering this, we compared our model errors with those derived from other methods (Beauchamp et al. 1989;Gyau-Boakye and Schultz 1994;Abebe and Price 2003;Coulibaly and Evora 2007;Caldera et al. 2016;Balcha et al. 2023), recognizing a reasonable comparability.For example, Coulibaly and Evora (2007) compared six ANN structures to predict missing precipitation data using three input raingauges, and their models have MAE in the range 1.5-2 mm and an R 2 of 0.75.On the other hand, using statistical methods Balcha et al. (2023) obtained a MAE ranging from 2 to 8 mm.Analyzing these cases, we recognized that AI techniques could have higher accuracy than traditional methods.This result could be attributed to the non-stationary behaviour of rainfall models and to the capacity of AI models to work with not linear relations (Creutin et al. 1997).
We can divide the models into two groups considering the spatial relationship between input and output raingauges: group A is composed of the cases for which the output raingauges are located out of the triangle; group B is composed of the cases for which the output raingauges are spatially located in the triangle with the input raingauges as vertices (see Fig. 8).The analysis of the two cases allows us to understand if the MAE values are related to these two configurations of the input and output.Group A counts 26% of the raingauges, whereas group B includes the remaining 74%.Both groups have an average MAE of about 3.2 mm, and the percentage of raingauges with an error greater than 5 mm is 12% for both cases.This analysis shows that the model errors are not related to the two types of relative positions between the output and input raingauges.
The errors of the model are strongly correlated with the difference between the mean rainfall of the input raingauges and the value of the output raingauges (Eq.5).The greater the difference between input and output data, the greater is the model prediction error, as shown in Fig. 6.Deepening this result (de Silva et al. 2007) in each of the time series used, we can find some records exhibiting a high difference between the input and output values at the same time t.In other cases, these differences are also in the input dataset.For each time series, we identified and counted the records in which one of the following conditions occurs: i) the output raingauge measured more than 5 mm (rainy day) and all three input raingauges measured 0 mm (no rainy day); ii) the output raingauge measured 0 mm (no rainy day) and the input raingauges measured more than 5 mm (rainy day); iii) the output raingauge recorded 0 mm and the average rainfall of the three input raingauges is greater than 5 mm (rainy day).
The procedure highlighted that each time series has a number of these cases, variable with a maximum number of more than 8%.We defined these as anomalous cases because it is very complex to understand the causes of these measured differences.We remember that the data used in this work are validated by the SIR, the Fig. 7 The plot shows the ratio between the MAE obtained for the two types of model architectures meteorological service which provided them.We cannot assert that these records are wrong because we cannot exclude a meteorological factor influencing the measure.At the same time, we are not sure that the cause is, at least completely, due to meteorological conditions because the stations are spatially very near and with a little difference in altitude.
The models show a higher error concerning the number of anomalous cases (Fig. 9).This relation is quantified by a PCC value of 0.71 for both the used architectures.This result highlights the data-driven behaviour of AI techniques and can be used to emerge these particular cases from the database.
This research will allow the creation of a control procedure on the time series to improve their knowledge and understand if the causes are meteorological or instrumental and improve the management of the database.

Conclusions
This work demonstrates that deep learning models can predict rainfall values using the time series of nearby raingauges as input.The errors of the models are comparable to those obtained by other works which used similar or different techniques.These models can be applied in the analysis of the rainfall time series, for instance, to compute the missing data.This problem is one of the main issues that afflict the meteorological time series, such as other types of environmental monitoring parameters.
However, this study also demonstrated that the deep learning model performances are strongly influenced by the input data, confirming the data-driven behaviour of these techniques.Major errors correspond to major differences among the input data or with the output values.The causes of these anomalies can be different.We cannot exclude meteorological factors, but we suppose that the main cause could be linked to raingauges malfunctions, given the specific selection of the input.This problem can affect all rainfall networks and improving knowledge on the identification of these anomalous data can allow a better management of the measurement network and validation procedures.

Fig. 1
Fig. 1 Tuscany Region.The red points indicate the 349 raingauges managed by the Regional Hydrologic Service (SIR) and used in this work

Fig. 2
Fig. 2 Architecture of the deep learning models used in this study.A) CNN-LSTM; B) ED-LSTM

Fig. 4
Fig. 4 Spatial location of Mean Absolute Errors (MAE) of the CNN-LSTM

Fig. 8
Fig. 8 Relation between input raingauges area (yellow) and respective output raingauges outside (A) and inside (B)