1 Introduction

Accurately estimating the rainfall rate allows for modelling hydrological and other environmental processes, and enabling timely countermeasures to mitigate the risks due to extreme events, such as flash floods and landslides. Real-time constraints, noise, outliers, incomplete and inconsistent data, and the need of integrating different sources of data, represent some of the main issues to consider in defining a solution able to effectively address this problem.

Rain Gauges (RGs) represent a valuable tool to estimate rainfall in complex environments as these sensors provide direct measurements of the precipitation at individual points in different time windows and permit to gather a large amount and high rate of information used to detect extreme rainfall events. Interpolation methods are usually employed to combine the information provided by different RGs and estimate the rainfall height for spatial coordinates not covered by sensors. In particular, Kriging geostatistical interpolation techniques [1] are widely adopted in the literature to address this problem.

However, since rainfall is a complex phenomenon characterized by a high spatial and temporal variability over a wide range of scales [2], the reconstruction of the space rain field from point measurements represents a challenging task [3]. Moreover, although dense, RG networks could be not able to gather enough information on the spatial variability of the precipitations, therefore as a result the interpolation of the gathered values may not provide a reliable estimate of the rainfall field for real-time warning systems or for post-analysis of critical events. Indeed, the information provided by the rain gauges is frequently inadequate to extract relevant patterns for convective storms characterized by strong intensity, high variation and short duration. As a consequence, heavy precipitations are poorly depicted and this scenario is further exacerbated in regions with a complex topography and small catchments [4].

Some recent approaches propose to address this problem by integrating heterogeneous rainfall data sources via interpolation methods [5]. The reconstruction of the spatial precipitation field can be heavily improved by coupling rain gauge measurements with precipitation estimates provided by meteorological radars [6]. Moreover, geostationary satellites (MSG) represent a further important source to estimate the rain field, since they allow for continuously monitoring large areas with problems deriving from the particular orography [7]. Processing and handling all these large amounts of heterogeneous data represents a hard challenge in this context.

A widely used approach for integrating distributed rain gauge point measurements with radar data is the Kriging with External Drift (KED), a multivariate geostatistical interpolation method allowing the processing of non-stationary random field (rainfall data, in the case of interest here) taking into account the spatial dependence of the random variable and also its linear relation to one or more additional variables (radar data) [8], but it is computationally expensive since it is based on Gaussian Process Regression [9] and do not take in account historical data that represents a precious source for improving the estimate.

Table 1 Analysis of ML/DL based techniques for rainfall estimation

1.1 Related works

Recently, a new line of research focused on the usage of Machine Learning (ML) techniques to automatically fuse information from heterogeneous data sources and to provide a more accurate rainfall estimation [20]. However, the adoption of ML techniques raises a number of issues to be coped with, such as imbalanced data, many missing attributes and real-time constraints, which require the development of specific solutions. In addition, integrating heterogeneous rainfall data sources to yield a more accurate estimation is a further relevant research topic, especially when the hydrological model requires a rainfall estimate as input [5]. Ensemble and Deep Learning (DL) are two data-driven paradigms able to effectively tackle these issues.

Ensembles [21] are well-known ML methods, where the output of several models trained against different data samples or using different algorithms, are then combined according to a given strategy for classifying new unseen instances. This paradigm permits us to cope with the problem of unbalanced classes and reduce the variance and the bias of the error. In the reference scenario, ensemble-based algorithms can be employed to address the above-cited issues related to the problem of rainfall estimation, such as unbalanced data and non-linear correlation among the different sources of data (radar, Meteosat and rain gauges).

An interesting preliminary ensemble approach (including also a neural network) to improve rainfall rates estimation is proposed in [10], however, it takes into account only radar data. Kuhnlein et al. [22] also exploit an ensemble-based approach that employs Random Forests to estimate rainfall rates from MSG data. A further work moving in this direction has been proposed in [12]. The authors devised a framework integrating different data sources via a Hierarchical Probabilistic Ensemble Classifier (HPEC). Experiments conducted on a real case study prove the importance to combine different data sources and the capability of the proposed ensemble method in alleviating the above-mentioned issues.

A different solution approach relies on the usage of (Deep) Neural Networks. Deep Learning [23] allows for learning accurate Rainfall Estimation Models (REM) by directly fusing raw low-level data gathered from heterogeneous data sources. Indeed, a hierarchical scheme is exploited in learning Deep Neural networks: basically, a DNN can be imagined as a stack of several levels, each one composed of a number of base computational units (neurons). Here, the idea is that every level of the DNN can extract high-level features, which can be provided as input for the next level of the network. In this way, the DNN enables the automatic extraction of discriminative data abstractions, therefore these models represent an effective solution to fuse different types of data provided in different formats. A further benefit in adopting the DL paradigm is the possibility to incrementally update the model as soon as new batches of data are made available so to ensure the scalability of the overall learning system.

For the above-cited motivations and also for the high non-linearity of the correlations between sensors data, cloud properties and rainfall estimates, recently, different machine learning (and also deep learning) techniques have been explored [24] [13] to improve the quality of the models for the problem of rainfall estimation.

In the literature, some preliminary approaches based on exploiting the DL paradigm have been proposed to address the rainfall estimation problem [16,17,18] but these solutions employ a single data source at a time or train a separate model for each source. For instance, with the aim to estimate rainfall rate on a daily timescale, the authors of [16] define an approach based on a NN trained against satellite data (single source). In the same way, the work in [17] designs a suitable NN-Based model to process data from the Meteosat data source.

A different solution is proposed in [13], i.e., deep autoencoders and Long Short Time Memory networks are combined to process image sequences concerning rainfall events. Basically, the approach allows for denoising, extracting high-level features and analyzing the data provided as input to the network.

In [18], the authors define a new methodology to integrate different precipitation data sources by combining an artificial neural network and a vector space transformation function. In the work [19], a Multi-Layer Perceptron is trained by using the surface and altitude mapping data. A Principal Component Analysis (PCA) is exploited to preprocess raw data and extract informative features to feed the DNN. Again, a preliminary approach to solve this task is proposed in [15] in which different data sources are automatically combined for providing a rainfall estimate by using a DNN with a particular topology.

Table 1 summarizes the most significant approaches among those described above. Compared to these approaches, to the best of our knowledge DeepEns-REM is the first solution that merges the Deep Learning paradigm with ensemble-based methods and integrates different types of data sources.

1.2 Our proposal

To overcome the main limitations of the above-described works and address the main challenges of the weather forecast domain (i.e., handling huge amounts of data yielded by the sensors, integrating different types of data from different sources and learning incrementally as soon as new batches of data are available) we defined a scalable DL-based framework able to effectively detect severe rainfall events.

In more detail, we propose to merge the robustness of the ensemble technique with the capability of DNNs to handle and automatically combine raw low-level data from heterogeneous data sources: an Ensemble of Deep Neural Networks (DeepEns-REM) fed with different types of data sources (i.e., RGs, radars and satellites) is introduced here. In order to better face the imbalanced nature of the training data (severe rainfall events are rare cases), the DNN architecture of the base classifiers has been designed to include a combination of dropout layers, batch normalization functions and residual-like connections. Each model composing the ensemble is discovered by exploiting a Snapshot procedure [25] and their predictions are combined via a not-trainable function. Notably, while Residual Blocks permit the reduction of the number of epochs for the learning algorithm convergence, the Snapshot procedure allows for learning the base models composing the ensemble in a single DNN training execution to ensure efficiency and scalability. Moreover, these classifiers are trained by using an ad-hoc cost-sensitive (imbalance-aware) loss function.

The experimental evaluation has been conducted on a real scenario, described in [12]. Specifically, data concern a peninsular region of Southern Italy (Calabria) and are provided by the Italian Department of Civil Protection. It is an effective test case for its complex orography and strong climatic variability. In addition, the complex orography, the heat flux coming from the Mediterranean Sea and the high reliefs near the warm sea interact strongly and favour the convective instability of this region, which also presents frequent floods and landslides.

The proposed framework overcomes the predictive performances of widely used and recognized methods (i.e., KED) and could be of support in many real contexts, as in the case of an officer of the Department of Civil Protection, which has to analyze the rainfall in specific zones exhibiting risks of landslides or floods and has to make real-time decisions in order to avoid damages, or in the worst case, injuries and deaths.

1.3 Organization of the paper

The rest of this article is organized as follows. Sect. 2 provides some background information about the problem of rainfall estimation normally defines the addressed problem. In Sect. 3 we introduce the reference framework. Section 4 is devoted to illustrating the Rainfall Estimation Model devised to tackle the task. Experimental results both in terms of effectiveness and efficiency of the proposed approach, conducted on the above-mentioned real test case, are shown in Sect. 5. Finally, Sect. 6 concludes the work by summarizing the obtained results and introducing some new research lines for future works.

2 Background and problem definition

In this work, we address the areal rainfall estimation problem, which can be defined as the problem to estimate a highly variable spatial field, both in space and time, by using data from a number of randomly placed points.

Direct measurements of rainfall are gathered by RG networks, which provide the amount of precipitation of a limited set of points over the area of interest. Indeed, even dense networks of rain gauges may result too sparse to compute an accurate interpolated spatial rainfall field for hydrological modelling. Estimates of a spatial precipitation field can be provided by other meteorological instruments, like weather radars, which provide high-resolution observations of precipitation spatial patterns. However, a radar measurement is an indirect estimate of rainfall, as it does not guarantee a precise quantitative measure of rainfall.

In addition, images from meteorological satellites represent a precious source to compute spatially distributed rainfall estimation [26]. In fact, from the observations taken from the satellites with different wavelengths, rainfall estimation methods can extract interesting characteristics of the clouds concerning possible ground precipitations and therefore this aspect can be used to indirectly measure the rainfall.

Notably, since the integration of the different temporal and spatial resolutions of the data sources is crucial to learning accurate estimation models, the preprocessing phase described [12] is adopted to make comparable the measurements logged by the different instruments.

Problem definition

As highlighted above, our final aim consists in discovering accurate predictive models able to estimate the accumulated rainfall raining at individual points (not monitored from the rain gauges) in a specific time interval by automatically fusing further information gathered from different heterogeneous data sources.

Basically, as shown in Fig. 1, each RG belonging to the network provides the measurement of the accumulated rainfall in the point (xy) where it has been deployed. Anyway, it can monitor only a limited part of the area where we want to estimate the rainfall rate, therefore our task consists in estimating the values of the remaining uncovered points of the grid for a specific time window.

Fig. 1
figure 1

View of the integrated rainfall data sources for a generic point \(p=(x,y)\). Three sources (RG network, Radar and Geostationary satellites) contribute to feeding the Rainfall Estimation Model

Formally, the above-described problem can be defined as follow.

Let be \(p_i=(x_i,y_i)\) the spatial coordinates of the \(i_{th}\) rain gauge belonging to the RG network composed of m nodes, and let be \(f_{rg}(p_i,t)=v\) a function providing the observed height of rainfall by the \(i_{th}\) rain gauge at time t. We denote with \(RG(t)=\langle {(p_1, f_{rg}(p_1,t)), (p_2, f_{rg}(p_2,t)),\ldots , (p_m, f_{rg}(p_m,t))}\rangle \) the list of pair containing RG spatial coordinates and rainfall observed at time t for each rain gauge belonging to the network. Notably, in our setting, we assume that the spatial coordinates of the rain gauges cannot change over time.

In addition, different types of side information, extracted from other data sources, can be coupled with the RG data to enrich the initial set of features. With this aim, let us consider a generic point p belonging to the area where we want to estimate the accumulated rainfall and not be covered by any rain gauge.

Let be \(f_\textrm{radar} (p,t)=\langle {rf_1,rf_2, \ldots ,rf_k}\rangle \), where \(rf_i\) is the value of the \(i_{th}\) feature gathered by the radar in p at time t.

Let be \(f_{msg} (p,t)=\langle {mf_1,mf_2, \ldots ,mf_l}\rangle \), where \(mf_i\) is the value of the \(i_{th}\) feature provided by the meteosat satellites in p at time t.

We define the function \(f_d(p,j)\), which, given a point p, returns its distance from the nearest \(j_{th}\) rain gauges in RG(t). We define the function \(f_n(p,j)\), which, given a point p, returns the index of the nearest \(j_{th}\) rain gauge in RG(t).

Afterwards, we define the function \(f_c(f_{rg}(p_i,t)) = c\) as a function \({\mathbb {R}}_{\ge 0} \mapsto C = \{ c_1, c_2, \dots , c_o \} \) which maps each value of height of rainfall with a specific (discrete) rainfall class c.

Let be

$$\begin{aligned}{} & {} f_{rg'}(p,t,nb)\\{} & {} \quad =\langle (p_{f_n(p,1)},f_d(p,f_n(p,1)), f_c(f_{rg}(p_{f_n(p,1)},t))), \\{} & {} \qquad (p_{f_n(p,2)},f_d(p,f_n(p,2)), f_c(f_{rg}(p_{f_n(p,2)},t))),\\{} & {} \qquad \ldots , (p_{f_n(p,nb)},f_d(p,f_n(p,l)), f_c(f_{rg}(p_{f_n(p,nb)},t)))\rangle \end{aligned}$$

where nb is the number of neighbours considered.

We define as extended data input (edi) for a point \(p_i\) at time t the following: \(edi(p,t,nb) = f_{rg'} \oplus f_{radar} \oplus f_{msg}\).

Finally, we define a Rainfall Estimation Model REM as a function \(REM: Q^{nb} \times {\mathbb {R}}_{\ge 0}^k \times {\mathbb {R}}_{\ge 0}^{l} \rightarrow C\) where \(Q = {\mathbb {R}}_{\ge 0}^{2} \times {\mathbb {R}}_{\ge 0} \times C\) that maps each tuple \(\uptau \equiv edi(p_i,t,nb)\) with a rainfall class i.e. \(\uptau \mapsto REM(\uptau )\) where \(REM(\uptau ) \in C\).

In a nutshell, given as input the concatenation of the data measured by the three data sources (rain gauges, radars and Meteosat satellites) at a given time t, this model (REM) returns the estimated class c of the rainfall event.

3 Reference framework

In this work, we extend the framework defined in [12] by including a novel REM based on an Ensemble of Deep Neural Networks. The software architecture of this framework is divided into different layers, which are used for data integration and preprocessing, for building the DNN-based model and finally for evaluating the performance and visualizing the results.

Fig. 2
figure 2

Rainfall estimation framework system

Figure 2 shows the above-mentioned layers, described in the following, starting from the lowest layer comprising the different information sources (i.e., rain gauges, radar and satellite data).

First, the Information Retrieval layer permits the extraction and the integration of the different types of information, extracted from the data sources. The connection with the different sources of data is guaranteed from the Data Source Connectors. Then, their output is sent to the Data Wrapper, which is responsible for supplying a unified view of these data to the upper level.

Afterward, data are passed to the Machine Learning layer, in which they are preprocessed to make them suitable for the analysis and an undersampling strategy is adopted to address the class unbalanced problem. In more detail, this layer includes three main modules: Data Preprocessing, Data Sampling and Model Building. First, missing values, outliers, and noise are handled by adopting different strategies to make data clean for the learning phase. Indeed, a big issue in handling the different distribution of the classes is represented by the so-named unbalancing problem, in which the majority class is often one or two orders of magnitude greater than the minority class. Consequently, the minority class is present only in a small portion of the training set, making really hard to train the classification models. In our framework, an ad-hoc undersampling strategy is used to mitigate this problem. Then, a model used to estimate the rainfall is generated by the REM builder model, trained from the data preprocessed with the other modules. Finally, test data, opportunely preprocessed similarly to the training data, and the REM model are passed to the last layer, in which the visualization of the final estimation of the rainfall, together with an evaluation of the performance of the model, is generated by using respectively the Rainfall Map Visualization and the REM Performance Evaluation module.

4 DeepEns-REM

In this section, we describe the classification model adopted to distinguish different rainfall classes. Specifically, our REM takes the form of an Ensemble of Residual Neural Networks learned by exploiting a Snapshot procedure. Below, first, we describe the architecture of the base learner, and then we illustrate the Snapshot procedure adopted for building the ensemble model.

4.1 ResNet-REM base model

In Fig. 3a, the overall architecture of the DNNs composing the ensemble is shown. Basically, it is a feed-forward neural network including a number of skip connections.

Fig. 3
figure 3

DeepEns-REM base components

Fig. 4
figure 4

Snapshot-based ensemble model

The usage of the skip connections induces in the base DNN classifier a behaviour similar to Residual Networks [27], which have been proven to be effective solutions to the well-known degradation problem (i.e., neural networks performing worse at increasing depth), and capable of ensuring a good trade-off between convergence rapidity and expressivity/accuracy.

Notably, these benefits seem to depend on the fact that a residual network can be figured out as an ensemble of alternative transformation paths: in practice, the final output of the network is yielded by combining paths with different lengths; therefore, a ResNet could be regarded as an ensemble itself.

The base learner architecture is based on the sub-net shown in Fig. 3b including three main components: (i) a fully-connected dense layer equipped with a (ReLU) activation function [28], (ii) a batch-normalization layer that enables more stability in the learning stage [29], (iii) a final dropout layer to address the overfitting issue[30]. Notably, in Fig. 3a, we refer with Building Block the stack of these three components. Interestingly, as highlighted by Hinton et al. in [31], dropout mechanisms enable within the neural net an ensemble behaviour: multiple sub-nets resulting from randomly masking some neurons are combined to yield the final output.

A Residual Block (including two Building Blocks stacked with a skip connection) can be replicated several times within the model to make the architecture more depth and the number of instances of this sub-net represents a parameter of the ResNet-REM.

We highlight that we are not addressing a classification problem but an ordinal learning task. In a nutshell, our main problem consists in ranking the rainfall events on the basis of their risk and identifying the most dangerous ones. Anyway, it would be better whether the errors of the model in estimating the severity of a rainfall event correspond to adjacent classes (i.e. a slightly heavier or lighter rainfall event). On the basis of this rationale, the last layer of our architecture is equipped with a single neuron that yields a prediction value ranging in the interval \([0,\mathrm{number\_of\_classes}]\)).

In the literature, Mean Absolute Error (MAE) [32] has been often adopted as loss function for ranking problems due to its robustness to outliers. However, in our scenario REM should focus on detecting exceptional behaviours by tackling the unbalance problem and, hence, reducing the classification error for the minority classes (exceptional cases). In this respect, to further improve the detection capabilities of the network we adopted a weighted variant of the MAE (named in the following Weighted-MAE), defined as follows:

$$\begin{aligned} \mathrm{Weighted-MAE} = \frac{1}{n} \sum _{i = 1}^{n} \Vert y^{(i)} - \tilde{y}^{(i)}\Vert \cdot weight(y^{(i)}) \end{aligned}$$
(1)

where n represents the batch size i.e. the number of examples, \(y^{(i)}\) and \(\tilde{y}^{(i)}\) are the real and the predicted value for the rainfall event respectively, and \(weight(y^{(i)})\) is a function that maps each class with a correspondent weight. In our framework, we adopt a linear weight function, i.e. instances belonging to the first class are weighted with 1, instances belonging to the second class are weighted with 2, and so on. Finally, the output of the Residual Blocks and the Building Block is concatenated and provided as input for the last layer of the model to exploit different levels of abstraction to yield the final decision. In the following, we refer to the base model as ResNet-REM.

4.2 Snapshot procedure

In our solution, we train several base models, whose output is then combined by using a not-trainable function (the max rainfall class, i.e., the most severe class event among the predicted ones by the base models). Each ResNet-REM composing the ensemble is yielded by means of a straightforward algorithm, which exploits the iterative gradient-descent optimization method of the DNN training stage. The main idea consists in updating the weights of the base DNN architecture, at each epoch, in opposite direction to the gradient, with the aim of progressively approaching a local minimum. In this scheme, the learning rate (lr) acts the role to control the convergence rate and it is iteratively updated to ensure fast convergence. Notably, both the local optimum achieved by means of the SGD procedure and its variants, depend on the initialization of the network and the lr value. As highlighted in [25], a cyclic reset of the lr yields a re-initializing of the network and a restarting of the optimization from another point in the search space and allows for discovering different variants of the model. In our framework, we adopted as a cyclic function, the shifted cosine function proposed by Loshchilov and Hutter [33].

Specifically, the DeepEns-REM is learned as follows. A ResNet-REM (named in the following as \(RN_1\)) is randomly initialized and trained. Hence, the algorithm iteratively learns the model \(RN_i\) by re-training \(RN_{i-1}\) with the initial learning rate \(\eta \) for a fixed number of epochs. At each epoch, \(\eta \) is progressively lowered. The trained \(RN_i\) is added to the ensemble list and lr is once again reinitialized. The defined Ensemble Strategy is presented in Fig. 4, where several ResNet-REMs \(\{RN_1, RN_2, \ldots , RN_n\}\) are suitable combined to compute the final prediction. In the application phase, the edi of each point to be estimated is provided as input to each base model composing the ensemble. The final rainfall severity score (i.e., the rain event class) is obtained by considering the max class value estimated by the base model i.e., \(max(C_{RN_1}, ..., C_{RN_n})\) where n is the number of REMs trained.

5 Evaluation

To evaluate the quality of our approach to estimating (extreme) rainfall events, a number of experiments on the real data concerning Calabria have been conducted. The first suite of experiments aims to compare our approach with some ensemble-based techniques (random forest, boosting) and, in addition, with the decision tree algorithm and the SVR (support vector regression model), which has been successfully used in the field of rainfall forecasting [34]. Furthermore, we compared DeepEns − REM with the KED (kriging with external drift) baseline method and with the Hierarchical Probabilistic Ensemble Classifier (HPEC)[12]. Note that KED represents the state-of-the-art among interpolation methods for rainfall estimation. Finally, we compared our deep ensemble model with two state-of-the-art neural architectures proposed in the literature.

In particular, all these test suites have been validated also on the minority class, representing extreme rainfall events.

5.1 Dataset, parameters and evaluation measures

The dataset used for the experiments measures the rainfall events for a period concerning the second half of 2016 in Calabria and it has 117, 600 tuples and 35 features. The pre-processing phase of this dataset is described in the previous subsection and in detail in [4]. Synthetically, for every cell, for a 30 minutes time step, 35 features were extracted, 11 features from the MSG satellites, 16 features from the rain gauges and 5 features from the radars. More in detail, for each point, we consider the 4 nearest cells in which a rain gauge is present (latitude, longitude, distance, and rainfall value measured for each of them). In addition, also the month of the measure is considered.

The class to be predicted, i.e., the mm of rain fallen in 30 minutes, was discretized into 5 classes, (i.e., less than 0.5, 0.5–2.5, 2.5–7.5, 7.5–15 and greater than 15) [12]. The last class is particularly important because it includes extreme rainfall events, which must be handled adequately. However, as a low number of tuples belong to this class, its classification is a really challenging problem.

Specifically, the base learner includes two instances of Residual Block. Each dense layer is equipped with a ReLU activation function and includes 64 neurons. RMSprop is used as optimizer with an initial \(lr=0.001\). DeepEns-REM is trained over 64 epochs and batch_size = 512. Finally, the Snapshot procedure is performed until 5 variants of the model \(RN_1\) have been obtained.

As for the implementation of the competitors, we adopted the well-known scikit-learn machine learning implementationFootnote 1 for the boosting, random forest and the SVR model. No tuning of the parameters was conducted. The random forest algorithm fixes a maximum depth for each base learner equals to 7, and an overall number of trees equal to 50. Standard parameters have been used for the other algorithms and parameters. Kriging with External Drift (the baseline method) is based on the “autokrige” function of the Automap library of the statistical software RFootnote 2, which automatically fits and chooses the best model among the different variogram models for the spatial interpolation.

Each model is trained against a sample of \(66\%\) of the dataset whereas it is evaluated on the remaining examples (test set). All the experiments were averaged over 30 runs. Note that, for a fair comparison with the kriging baseline method, we adopted the same above-defined partition for the training and test set. In addition, due to the nature of the interpolation method of the kriging, first, the points belonging to the test dataset were removed and the other points (training set) were used to train the autocorrelation models.

Due to the rarity of the minority class (exceptional rainfall events), and in order to avoid issues concerning the correctness of the evaluation of this class, we adopted measures particularly apt to cope with unbalanced problems. Therefore, in addition to the metrics of precision and recall, usually used in the balanced case, we employ other metrics specific for the rainfall estimation [36], i.e., the False-Alarm Ratio (FAR), the Critical Success Index (CSI) and the Probability of Detection (POD). However, these metrics are defined for a two-class scenario; in a multi-class scenario, they can be extended in many ways [37], mainly using two different techniques for averaging these evaluation measures: macro-averaging and micro-averaging. By using the macro-based technique, first, for each class, the performance measure is computed and then it is averaged; for the other case (micro), first, a cumulative sum of the counts (TPs, FPs, FNs, TNs) is calculated and then the overall measure is computed. We adopted macro-averaging, which weights all the classes equally (micro is more biased towards larger classes), as we are mainly interested to predict the minority class.

5.2 Convergence analysis

In this section, we analyze the model behaviour at learning time. Basically, we are interested to investigate whether DeepEns-REM architecture enables a faster convergence of the algorithm to reduce the number of epochs required to learn an accurate predictive model. Basically, we compared two architectures (respectively named in figure, ResNet-REM and Base DNN), that include the same number of building blocks (i.e., a block composed of a feed-forward dense layer, a batch normalization layer and a dropout layer). The only difference between the architecture relies on the presence of the skip connections and residual blocks in the former one.

Fig. 5
figure 5

Convergence analysis for ResNet-REM and Base DNN (Weighted-loss)

Fig. 6
figure 6

Convergence analysis for ResNet-REM and Base DNN (Macro F1-score)

In Fig. 5a, the training loss value per epoch is shown. First, we pinpoint that, at the first iteration, the ResNet-REM loss value is lower than the Base DNN one of 0.1; therefore, we can observe the benefits of adopting the skip connections already in the early stages of the training phase. Moreover, \(\sim 38\) epochs are required in order for the two models to exhibit the same values of loss. This behaviour is even more evident in Fig. 5b, where we monitored the loss values in the validation set. In particular, we can observe an unstable behaviour for the loss in the Base DNN, whereas ResNet-REM is able to achieve a stable value of loss in a limited number of iterations. In the Fig. 6a b, we also report the Macro-F1 values for both architectures. Once again, as regards the values for the training set, Base DNN is not able to achieve the same performance in 40 epochs, whereas, on the validation set, we observe again an unstable behaviour for this algorithm.

5.3 Performance analysis

Table 2 CSI, FAR, POD, and MSE for the SVR, Decision tree, boosting, random forest and DeepEns-REM. The values in bold (italic face) are significantly better (worse) than at least other two methods
Table 3 Precision, recall and F-measure for the SVR, decision tree, boosting, random forest and DeepEns-REM for the minority class (extreme rainfall events). The values in bold (italic face) are significantly better (worse) than at least other two methods

Here, we evaluate the rainfall estimation capability of DeepEns-REM in comparison with other Machine Learning-based techniques, in particular ensemble methods (including the Hierarchical Probabilistic Ensemble Classifier proposed in [12]) and with the baseline method (kriging with external drift). Mainly, we are interested in analyzing the behaviour of our algorithm in coping with exceptional rainfall events (minority class).

In order to evaluate whether the differences between the performance of the models are significantly different, a Wilcoxon signed-ranked test with a confidence level of 0.95 (\(\alpha = 0.05\)) was used. The Wilcoxon test is a non-parametric pairwise test that can be used to detect significant differences between two algorithms. It is analogous to the paired t test, however, it is largely used in the literature as it does not assume normal distributions and the outliers (exceptionally good/bad performance) have less effect on it.

Table 4 CSI, FAR, POD and MSE for Kriging, HPEC and DeepEns-REM. The values in bold (italic face) are significantly better (worse) than the Kriging method
Table 5 Precision, recall and F-measure for the Kriging, HPEC and DeepEns-REM for the minority classes (extreme rainfall events). The values in bold (italic face) are significantly better (worse) than the Kriging method

In the first suite of experiments, we compare DeepEns-REM with other ML-Based methods: in Table 2, we report CSI (higher is better), FAR (lower is better) and POD (higher is better) macro-averaged for all the rainfall classes, while Table 3 shows precision, recall and F-measure for the class including extreme events. The values, which are significantly better, are reported in bold while the values that are significantly worst are reported in italic face.

As reported in 2, DeepEns-REM exhibits significantly better rainfall detection capabilities since it obtains the best value in terms of POD. Moreover, as shown in Table 3, it represents a good trade-off between precision and recall in detecting extreme events with the best value in terms of F-Measure.

Once again, Table 4 reports CSI (higher is better), FAR (lower is better), POD (higher is better) for all the classes, while Table 5 reports precision, recall and F-measure for the minority class, as in the previous experiments, but in this case, we compare DeepEns-REM, HPEC and RF with the baseline, Kriging.

We can observe that ML-Based methods outperform the Kriging in both the tasks i.e., accurately depicting rainfall field and detecting extreme events. In particular, also in this case, DeepEns-REM is significantly better in terms of F-Measure for the minority class.

Fig. 7
figure 7

A qualitative example of how DeepEns-REM, HPEC [12] and KED reconstruct the rainfall map, together with the map of the real values obtained in the rain gauges

Finally, in Fig. 7, we show the rainfall field respectively reconstructed by DeepEns-REM, HPEC and KED for a particularly severe event that occurred on November 6, 2017, in which three RGs observed their max values in the same year (i.e., 42.4 mm for hour). Notably, we discretized the values computed by KED technique on the basis of the same classes considered for learning the ML-Based models. Specifically, DeepEns-REM exhibits an interesting behaviour by highlighting more events belonging to the class V and yielding a more detailed map for these cases, however, further investigations should be conducted to draw a final conclusion.

5.4 Comparison with deep learning architectures

In this subsection, we want to compare our deep ensemble-based solution with other neural models adopted in the literature. As highlighted in Sect. 2, current literature mainly focuses on the usage of feed-forward neural networks (combined with different types of pre-processing procedures) to address our same task. In this respect, to perform a fair comparison we considered two advanced feed forward neural networks having a number of parameters (i.e., the neural network weights) comparable with our base model. The former (hereinafter named DNN_BB) is a deep neural network obtained by stacking several building blocks composed of three layers: a Dense Layer, a Batch Normalization Layer and a Dropout Layer. In our experimentation, we tested a network including six building blocks. The latter neural model is a residual net (hereinafter name DNN_RES) composed of three residual blocks. To make the comparison fair, both the competitors have the same complexity (i.e., they have the same number of neurons, optimizer, activation functions, they are trained with the same parameters, etc. of our model). In Tables 6 and 7, we reported the results of our analysis: basically, our model exhibits better performances compared with its competitors, obtaining the best results in terms of CSI, POD and MSE, maintaining the same level of FAR. In particular, if we focus the analysis on the detection of extreme rainfall events, DeepEns-REM represents the best trade-off between precision and recall; indeed, it obtains the best results in terms of recall by limiting the loss in terms of precision.

Table 6 CSI, FAR, POD, and MSE for DNN_BB, DNN_RES and DeepEns-REM. The values in bold (italic face) are significantly better (worse) than at least other two methods
Table 7 Precision, recall and F-measure for for DNN_BB, DNN_RES and DeepEns-REM for the minority class (extreme rainfall events). The values in bold (italic face) are significantly better (worse) than at least other two methods

5.5 Analysis of statistical significance

In order to assess, the significance of the differences observed between our approach and the different methods considered in the experiments presented in the previous subsections, the popular Friedman-Nemenyi statistical procedure [38, 39], a widely used in the evaluation of classifiers, was adopted. This procedure essentially consists of two phases: first, the Friedman test is employed to possibly reject the null hypothesis \(H_0\) that the different populations of results (produced each by a distinguished classifier-induction method) have the same mean, so that they can be considered as statistically different (\(\alpha = 0.05\)). Then, if \(H_0\) is rejected, a post-hoc Nemenyi test with a significance level of 0.05 is used (as post-hoc test) to detect all the pairs of methods that are significantly different from one another: if a pair of methods is assigned a p-value under 0.05, these methods are eventually deemed as significantly different from a statistical viewpoint.

This statistical study includes both the machine-learning based methods (i.e, SVR, Decision Tree, Boosting and Random Forest) and the specific-domain approach (i.e., DNN_REM, Kriging and HPEC). Among the different metrics, considered in our experiments, we show only the most relevant for our problem: POD and MSE for all the classes and the F-Measure in the case of the extreme rainfall events.

Figures 8, 9 and 10 show the critical difference (CD) diagram, respectively for the metrics of POD, MSE and F-measure (for the case of extreme rainfall events), obtained by using the Friedman-Nemenyi procedure described above. In this diagram, two methods are connected through a horizontal line iff they resulted not significantly different from one another.

Fig. 8
figure 8

Critical difference (CD) diagram for POD metric (Friedman test + Nemenyi test, \(\alpha = 0.05\)). All (and only) the pairs of methods resulting not significantly different according to the test (i.e., receiving a p value \(\ge 0.05\)) are connected through a horizontal line

Fig. 9
figure 9

Critical difference (CD) diagram for MSE metric (Friedman test + Nemenyi test, \(\alpha = 0.05\)). All (and only) the pairs of methods resulting not significantly different according to the test (i.e., receiving a p value \(\ge 0.05\)) are connected through a horizontal line

Fig. 10
figure 10

Critical difference (CD) diagram for F-measure metrics concerning the extreme rainfall events (Friedman test + Nemenyi test, \(\alpha = 0.05\)). All (and only) the pairs of methods resulting not significantly different according to the test (i.e., receiving a p value \(\ge 0.05\)) are connected through a horizontal line

From the Fig. 8, all the methods are significantly different (with the exception of Random Forest and Boosting) and only the HPEC technique obtains better performance than our method, which outperforms all the others.

In terms of MSE and F-Measure, our approach (DNN_REM) overcomes the other methods. However, in the case of MSE, it is not significantly better than Random Forest, while for the F-Measure, it is not significantly better than the Kriging technique.

6 Conclusions and future works

A novel Machine Learning-based model for spatial field estimation has been proposed in this work. The main idea consists in merging Ensemble and Deep Learning techniques for estimating rainfall intensity in areas not covered by rain gauge sensors. Specifically, a Rainfall Estimation Model (REM) is fed with heterogeneous data from different sources (i.e., Rain gauge network, Radar and Geostationary satellites) and allows for automatically fusing these data so to yield more accurate estimates. In our setting, our REM takes the form of an Ensemble of Deep Neural Networks built by using a Snapshot procedure and integrating Residual sub-nets to guarantee a fast convergence and scalability and make the approach apt to efficiently cope with large real datasets.

An experimental evaluation, conducted on a real application scenario, concerning a peninsular region of Southern Italy (Calabria), demonstrates the effectiveness of our proposal in recognizing severe events in comparison with state-of-the-art approaches used in the literature to solve this task. In more detail, our approach exhibits significantly better rainfall detection capabilities both in comparison with other ML-based techniques and with a state-of-the-art technique in the domain of rainfall estimation. In addition, as for the task of detecting extreme rainfall events, the obtained estimation is a good trade-off in terms of the metrics of precision and recall.

In future work, we plan to investigate spatio-temporal correlations by adopting more sophisticated Deep Learning architectures keeping into account these correlations and exploring the accuracy of the method in detecting specific extreme rainfall events.