1 Introduction

Atmospheric extreme events (EEs, either weather, or climate-related) gravely impact societies (Horton et al. 2016), causing hundreds of thousands of deaths every year (De et al. 2004; Pörtner et al. 2022), and producing important collateral effects, such as migrations (Marchiori et al. 2012; Carrico and Donato 2019), infrastructure damages (May and Koski 2013), transportation problems (Trinks et al. 2012; Stamos et al. 2015), and damages to agriculture (Ciais et al. 2005; van der Velde et al. 2012; Lal et al. 2012) or ecosystems (Seneviratne et al. 2012; Knapp et al. 2008; Van Oijen et al. 2013; Woodward et al. 2016).

As the number and intensity of EEs have been increasing in the last few decades (likely as a consequence of climate change processes (Mitchell et al. 2006; Herring et al. 2015; Grant 2017)), so has the number of scientific studies on them. In this context, some classical problems associated with EEs are their analysis (Herring et al. 2015), detection (Zscheischler et al. 2013; Easterling et al. 2016), and causation/attribution to human activities (Stott et al. 2016; Hannart and Naveau 2018; Runge et al. 2019; Madakumbura et al. 2021). Also, different authors have focused their research on studying compound EEs (combinations of multiple EEs that contribute to societal or environmental risk) (Zscheischler and Seneviratne 2017; Zscheischler et al. 2020, 2018; Raymond et al. 2020), the relationship of EEs with different processes such as carbon cycle (Reichstein et al. 2013; Frank et al. 2015; van der Molen et al. 2011) or soil moisture (Hirschi et al. 2011; Whan et al. 2015), and the effects of EEs on economics (Chavez et al. 2015; Ackerman 2017) and their impact on human systems (Zscheischler et al. 2014), to name just a few.

Different mathematical and computational methods have been used to analyze and forecast EEs, including numerical weather methods (NWM) (Lavers and Villarini 2013; Yucel et al. 2015; Vitart and Robertson 2018), statistical, and probability-based methods (Ferro 2007; Naveau et al. 2020; Sapsis 2021), non-linear physics and chaos theory (Ghil et al. 2011; Farazmand and Sapsis 2019; Chowdhury et al. 2022), and, in the last decade, an important number of machine learning (ML) and related techniques, a field with an exponential presence in climate and atmospheric sciences (Monteleoni et al. 2013; Cohen et al. 2019), climate change studies (Rolnick et al. 2019), and Earth system science in general (Reichstein et al. 2019; Karpatne et al. 2018; Camps-Valls et al. 2019; Salcedo-Sanz et al. 2020; Bonavita et al. 2021; Irrgang et al. 2021). In the last years, deep learning (DL) algorithms, a particularly promising branch of ML, have also been applied to climate science problems (Kurth et al. 2018; Ardabili et al. 2019), where they have shown great potential to deal with different EE-related problems (Liu et al. 2016; Ren et al. 2020; Qi and Majda 2020; Fang et al. 2021).

In this paper, we discuss the most important ML methods applied to atmospheric EE-related problems, including DL approaches. It is possible to classify atmospheric EEs in terms of their physical characteristics and impact on human society and ecosystems. In addition, different ML techniques have been associated with specific problems in EEs, for example, feature selection/extraction problems in ML have been usually associated with the detection of EEs, in such a way that the ML algorithms are able to select the most important feature which triggers an EE. If we include specific drivers to train the ML problems, we can deal with the attribution of atmospheric EEs. The attribution of EEs with ML involves the application of the algorithms in data (measurements, reanalysis or simulations) from different periods and/or forcings. Finally, ML approaches have been also used to deal with prediction problems related to EEs. This is maybe the most common application of ML in EEs, and it is possible to find a prediction of problems related to EEs in different prediction time horizons, from very short-term to seasonal prediction. Having these ideas in mind, we have chosen a number of atmospheric EEs in terms of their impacts on human societies and ecosystems, to carry out the review of ML methods applied to describe them. In this case, we have chosen extreme precipitation and floods, extreme temperatures and heatwaves, droughts, severe weather and low-visibility events. We provide a comprehensive review of the works applying ML and DL algorithms for these EEs problems, and we finally discuss a case study on ML and DL techniques focused on heatwaves prediction, some final perspectives on this research area in the near future.

The rest of the paper is structured as follows: the next section will give a theoretical overview of some of the ML algorithms most commonly used for studying EEs. Section 3 presents a comprehensive analysis of existing literature on ML and DL techniques for atmospheric EEs problems. Section 4 presents a case study on heatwaves prediction with ML and DL techniques, while Sects. 5.2 and 5 provide conclusions, final perspectives and a general outlook on future research.

2 Machine learning methods

This section summarizes the most important ML, DL methods, and related techniques commonly used in the analysis and prediction of EEs.

2.1 Feature selection methods and dimensionality reduction in ML and DL

For ML-based methods, using irrelevant or redundant features as inputs during training can be detrimental, not only because these additional features would increase the training time, but also because they may hinder their generalisability (Blum and Langley 1997). In its more general form, the feature selection problem (FSP) in ML problems can be defined as follows: given a set of labelled data samples \(\left\{ (\textbf{x}_1,y_1),\ldots ,(\textbf{x}_l,y_l)\right\} \), where \({\textbf{x}}_i \in \mathbb {R}^n\) and \(y_i \in \mathbb {R}\) (or \(y_i \in \{\pm 1\}\) for classification tasks), obtain subset of m features (\(m<n\)), that produces the lowest prediction (or classification) error in the estimation of the variable \(y_i\).

Fig. 1
figure 1

a Outline of a wrapper method; b outline of a filter method

There are many different approaches to dealing with FSP problems (Zebari et al. 2020). In general, FS algorithms can be classified into three families:

  • The wrapper approach (John et al. 1994). Wrapper methods use the ML classifier/regressor in order to obtain the best set of features which minimizes an error measure. Figure 1a shows an outline of the wrapper approach. The interested reader can consult classical works on wrapper FSP approaches (Kohavi and John 1997; Yang and Honavar 1998).

  • The filter approach to the FSP is based on a completely different idea. In this case, the selection of the best features is based on an external measure calculated from the data, and the classifier/regression algorithm is not taken into account. Figure 1b shows an example of a filter approach for an FSP problem. Note that filter methods are usually faster than wrapper methods, but in general, wrappers obtain better results, since they take into account the real performance of the classification/regression algorithm during the search. The interested reader can extend the analysis of filter methods in Torkkola and Campbell (2000); Torkkola (2002).

  • Finally, mixed or hybrid approach. They are methods which combine wrappers and filter approaches into a single hybrid methodology. They have obtained good results in different specific applications (Ferreira and Figueiredo 2014; Huda et al. 2014; Solorio-Fernández et al. 2016).

Note that both wrapper and filter methods admit a binary representation for the FSP, where a 1 in the \(i_{th}\) position of the binary vector stands for the feature i is considered within the subset of features, and a 0 means it is not. Using this notation there are \(2^n\) different subsets of features to be evaluated (where n is the total number of features), and the problem consists of selecting the best one in terms of a given error measure, either internal (wrapper methods) or external (filter methods) to the classifier/regressor considered. Alternative encodings with integer numbers are also possible. Given the large search space generated by the encoding of the FSP, meta-heuristic approaches are commonly applied to obtain the best set of features, mainly in the wrapper approach (Salcedo-Sanz et al. 2018).

2.1.1 Other dimensionality reduction methods in ML and DL

In addition to classical feature selection methods shown above, there are different traditional dimensionality reduction methods (Ghodsi 2006; Van Der Maaten et al. 2009; Huang et al. 2019; Ghojogh et al. 2023) thought to improve the performance of ML and DL techniques. We review here some of the methods which have been used the most to improve ML and DL techniques in EEs detection, prediction and attribution problems. For instance, the well-known principal component analysis (PCA) (Abdi and Williams 2010), aims to find a linear subspace of low dimension that maintain most of the variability in the data. Also Linear Discriminant Analysis (LDA) (Balakrishnama and Ganapathiraju 1998), is based on the idea of finding a linear combination of features that characterizes or separates two or more classes of objects or events. Another example of a traditional dimensionality reduction technique is locally linear embedding (LLE) (Roweis and Saul 2000), a nonlinear approach to reduce dimensionality by computing low-dimensional, neighbourhood-preserving embedding of high-dimensional data.

Fig. 2
figure 2

AE structure

The autoencoder (AE) neural network can also be used for reducing the dimensionality of the data (Pinaya et al. 2020). They aim to reproduce the input in the output (Goodfellow et al. 2016). It is composed of two different parts: the encoder and the decoder. The intermediate representation is called latent space. It can be understood as a meaningful representation of the data. The data is decoded to reconstruct as similar as possible the input data, Fig. 2. A probabilistic framework was introduced with variational AE (VAE) Kingma and Welling (2013). One of the main differences between AEs and VAEs is the latent space representation (Fig. 3). The AE learns a continuous latent space representation for the input data. Thus, a unique encoding of the input is found for each point in the latent space. In the latent space of the VAE, the points follow a probability function. Thus, for each point of the latent space, a sample from the distribution is found. Another difference is related to the loss function. While the AE minimizes a reconstruction loss between the input and the output, the VAE aims to optimize two different terms. The first one refers to the reconstruction loss, whilst the second one is based on the Kullback–Leibler divergence loss. It aims at the latent space to follow the desired probability distribution. In some applications it is important to note these significant differences between AE and VAE.

Fig. 3
figure 3

VAE structure

2.2 Multi-layer perceptrons

A multi-layer perceptron (MLP) is a particular class of artificial neural network (ANN), which has been successfully applied to solve a large variety of non-linear problems, mainly classification and regression tasks (Haykin and Network 2004; Bishop 1995). The multi-layer perceptron consists of an input layer, a number of hidden layers, and an output layer, all of which consist of a number of special processing units called neurons. All the neurons in the network are connected to other neurons by means of weighted links (see Fig. 4). In a feedforward MLP, the neurons within a given layer are connected to those of the previous layer. The values of these weights are related to the ability of the MLP to learn the problem, and they are learned from a sufficiently long number of examples. The process of assigning values to these weights from labelled examples is known as the training process of the perceptron. The adequate values of the weights minimize the error between the output given by the MLP and the corresponding expected output in the training set. The number of neurons in the hidden layer is also a hyperparameter to be optimized (Haykin and Network 2004; Bishop 1995).

Fig. 4
figure 4

Structure of a multi-layer perceptron neural network, with one hidden layer

The input data for the MLP consists of a number of samples arranged as input vectors \(\{\textbf{x}^i\in \mathbb {R}^n\}_{i=1}^N\), with each input vector \(\textbf{x}^i=(x^i_1,\cdots ,x^i_n)\). Once an MLP has been properly trained, it can be tested on data it did not see during training to evaluate its performance, in terms of how well the learned weights can transform the given input into a desired output \(\vartheta \in \mathbb {R}\). The relationship between the output \(\vartheta \) and a generic input signal \(\textbf{x}=(x_1,\cdots ,x_n)\) of a neuron is given by:

$$\begin{aligned} \vartheta (\textbf{x})=\varphi \left( \sum _{j=1}^n w_j x_j- b\right) , \end{aligned}$$
(1)

where \(\vartheta \) is the output signal, \(x_j\) for \(j=1,\ldots ,n\) are the input signals, \(w_j\) is the weight associated with the j-th input, b is the bias term (Haykin and Network 2004; Bishop 1995), and \(\varphi \) is some function chosen based on the type of layer to which it needs to be applied, for example the logistic function (among other possibilities):

$$\begin{aligned} \varphi (x)=\frac{1}{1+e^{-x}}. \end{aligned}$$
(2)

The well-known stochastic gradient descent (SGD) algorithm is often applied to train MLPs (Rumelhart et al. 1986). There are also alternative training algorithms for MLP which have shown excellent performance in different problems, such as the Levenberg-Marquardt algorithm (Hagan and Menhaj 1994), or the ADAM and RMSProp optimizers for training deep versions of the networks (Zhang 2018; Zou et al. 2019).

2.2.1 Extreme learning machines

An extreme learning machine (ELM) (Huang et al. 2006) is a type of training method for multi-layer perceptrons, characterized by being computationally faster than traditional gradient backpropagation (Hecht-Nielsen 1992). In the ELM algorithm, the weights between the inputs and the hidden nodes are set at random, usually by using a uniform probability distribution. Then, the output matrix of the hidden layer is established and the Moore-Penrose pseudo-inverse of this matrix is computed. The optimal values of the weights belonging to the output layer are directly obtained by multiplying the computed pseudo-inverse matrix with the target (see Huang et al. (2011) for details). The ELM obtains competitive results with respect to other classical training methods, while its training computation efficiency overcomes other classifiers or regression approaches such as SVM algorithms or MLPs (Huang et al. 2011).

Mathematically, the ELM algorithm considers a training set \(\lbrace ({\textbf {x}}_i,y_i)\rbrace _{i=1}^n\) to fit the weights \((\beta _k)\) associated with each hidden node \(\tilde{N}\) to optimally compose an output with minimum mean squared error. The training process is according to the following steps:

  1. 1.

    The input weights \({\textbf {w}}_k\) and the bias \(b_k\), where \(k = 1, \ldots ,\tilde{N}\) are randomly chosen following a uniform distribution with support \([-1,1]\).

  2. 2.

    In the second step, the hidden-layer output matrix H is computed as follows:

    $$\begin{aligned} {\textbf {H}} = \left[ \begin{array}{ccc} g( {\textbf {w}}_1 {\textbf {x}}_1 + b_1) &{} \cdots &{} g({\textbf {w}}_{\tilde{N}} {\textbf {x}}_1 + b_{\tilde{N}}) \\ \vdots &{} \cdots &{} \vdots \\ g({\textbf {w}}_1 {\textbf {x}}_N + b_1) &{} \cdots &{} g({\textbf {w}}_{\tilde{N}} {\textbf {x}}_N + b_{\tilde{N}}) \end{array} \right] _{\tilde{N}} \end{aligned}$$
    (3)

    where \(g(\cdot )\) is the activation function.

  3. 3.

    The training problem is reduced to a \(\varvec{\beta }\) parameter optimization problem, which can be defined as:

    $$\begin{aligned} \min \limits _{\varvec{\beta }} \Vert {\textbf {H}} \varvec{\beta }-\textbf{Y}\Vert , \end{aligned}$$
    (4)
  4. 4.

    The last step consists in obtaining the output layer weights \(\varvec{\beta }\) by means of the following expression:

    $$\begin{aligned} \varvec{\beta }= {\textbf {H}}^\dagger {\textbf {Y}}^T, \end{aligned}$$
    (5)

    where \({\textbf {Y}}^T\) stands for the transpose of the training output vector \({\textbf {Y}}=[y_1,\ldots ,y_n]\) and \({\textbf {H}}^\dagger \) refers to the Moore-Penrose pseudo-inverse of the hidden-layer matrix \({\textbf {H}}\) (Huang et al. 2006).

  5. 5.

    Then, the predicted or classified output is obtained as: \(\hat{Y}(\textbf{x}) = {\textbf {H}} \varvec{\beta }\).

The hidden nodes number \(\tilde{N}\) can be tuned for improving the ELM performance.

2.3 Support vector machines

A support vector machine (SVM) (Schölkopf et al. 2002, 2000) is a statistical learning algorithm for classification problems defined as follows: given a labelled training data set \(\{\textbf{x}_i,y_i\}_{i=1}^n\), where \({\textbf{x}}_{i}\in {\mathbb {R}}^{N}\) and \(y_i\in \{-1,\,+1\}\), and given a nonlinear mapping \({\varvec{\phi }}(\cdot )\), the SVM method solves the following problem:

$$\begin{aligned} \min _{\textbf{w},\xi _{i},b} \left\{ \frac{1}{2}\Vert \textbf{w}\Vert ^{2}+C\sum _{i=1}^n \xi _{i}\right\} \end{aligned}$$
(6)

constrained to:

$$\begin{aligned} \begin{aligned}&y_i \left( \textbf{w}^\top \phi (\textbf{x}_i)\right) + b \ge 1 - \xi _i,&1\le i \le n\\&\xi _i \ge 0,&1\le i \le n \end{aligned}. \end{aligned}$$
(7)

where \(\textbf{w}\) and b define a linear classifier in feature space, and \(\xi _i\) are positive slack variables enabling to deal with permitted errors (Fig. 5). Appropriate choice of nonlinear mapping \(\varvec{\phi }\) guarantees that the transformed samples are more likely to be linearly separable in the (higher dimensional) feature space. The regularization hyperparameter C controls the generalization capability of the classifier, and it must be selected by the user. The core problem (6) is solved using its dual problem counterpart (Schölkopf et al. 2002), and the decision function for any test vector \(\textbf{x}_*\) is finally given by

$$\begin{aligned} f(\textbf{x}_*) = {\text {sgn}}\left( \sum _{i=1}^n y_i\alpha _i K(\textbf{x}_i,\textbf{x}_*) + b\right) \end{aligned}$$
(8)

where \(\alpha _i\) are Lagrange multipliers corresponding to constraints in (7), being the support vectors (SVs) those training samples \(\textbf{x}_i\) with non-zero Lagrange multipliers \(\alpha _i \ne 0\); \(K({\textbf{x}}_i,{\textbf{x}}_*)\) is an element of a kernel matrix \(\textbf{K}\) (Schölkopf et al. 2002); and the bias term b is calculated by using the unbounded Lagrange multipliers as \(b = 1/k \sum _{i=1}^k (y_i - \langle \varvec{\phi }({\textbf{x}}_i),\textbf{w}\rangle )\), where k is the number of unbounded Lagrange multipliers (\(0 \leqslant \alpha _i < C\)) and \(\textbf{w} = \sum _{i=1}^n y_i \alpha _i \varvec{\phi }({\textbf{x}}_i)\) (Schölkopf et al. 2002).

Fig. 5
figure 5

Illustration of the SVM process: linear decision hyperplanes in a nonlinearly transformed, feature space, where slack variables \(\xi _i\) are included to deal with errors

Fig. 6
figure 6

Example of a support-vector-regression process for a two-dimensional-regression problem, with an \(\epsilon \)-insensitive loss function

2.3.1 Support vector regression

Support vector regression (SVR) (Smola and Schölkopf 2004) is a well-established algorithm for regression and function approximation problems. SVR takes into account an error approximation to the data, as well as the capability to improve the prediction of the model when a new dataset is evaluated. Although there are several versions of the SVR algorithm, we show the classical model (\(\epsilon \)-SVR) described in detail in Smola and Schölkopf (2004), which has been used for a large number of problems and applications in science and engineering (Salcedo-Sanz et al. 2014).

The \(\epsilon \)-SVR method for regression starts from a given set of training vectors \(\{(\textbf{x}_i,\vartheta _i)\}_{i=1}^N\), where \({\textbf{x}}_{i}\in {\mathbb {R}}^{N}\) and \(\vartheta _i\in { \mathbb {R}}\), and model the input–output relation as the following general model:

$$\begin{aligned} \hat{\vartheta }(\textbf{x})=g(\textbf{x})+b = \textbf{w}^T\phi (\textbf{x}) + b, \end{aligned}$$
(9)

where \(\textbf{x}_i\) represents the input vector of predictive variables, \(\vartheta _i\) stands for the value of the objective variable \(\vartheta \) corresponding to the input vector \(\textbf{x}_i\) and \(\hat{\vartheta }(\textbf{x})\) represents the model which estimates \(\vartheta (\textbf{x})\). The parameters \((\textbf{w},b)\) are determined in order to match the training pair set, where the bias parameter b appears separated here. The function \(\phi (\textbf{x})\) projects the input space onto the feature space. During the training, the algorithms seek those parameters of the model which minimize the following risk function:

$$\begin{aligned} R[\hat{\vartheta }] = \frac{1}{2} \Vert {\textbf {w}} \Vert ^2 + C \sum _{i=1}^N L\left( \vartheta _i,\hat{\vartheta }(\textbf{x}_i)\right) , \end{aligned}$$
(10)

where the norm of \(\textbf{w}\) controls the smoothness of the model and \(L\left( \vartheta _i,\hat{\vartheta }(\textbf{x}_i)\right) \) stands for the selected loss function. We use the \(L^1\)-norm modified for the SVR and characterized by the \(\epsilon \)-insensitive loss function (Smola and Schölkopf 2004):

$$\begin{aligned} d L\left( \vartheta _i,g(\textbf{x}_i)\right) =\left\{ \begin{aligned}&0{} & {} \text{ if }~~|\vartheta _i-g(\textbf{x}_i) |\le \epsilon \\&|\vartheta _i-g(\textbf{x}_i)|-\epsilon{} & {} \text{ otherwise }. \end{aligned} \right. \end{aligned}$$
(11)

Figure 6 shows an example of the process of a SVR for a two-dimensional regression problem, with an \(\epsilon \)-insensitive loss function.

Fig. 7
figure 7

Diagram of the bagging technique used for classification or regression problems in ML

To train this model, it is necessary to solve the following optimization problem (Smola and Schölkopf 2004):

$$\begin{aligned} \begin{aligned} \min _{\textbf{w},b,\varvec{\xi }} \quad&\frac{1}{2}\Vert \textbf{w} \Vert +C\sum _{i=1}^{N}{\xi _{i} + \xi _i^*},\\ \text {s.t.} \quad&\vartheta ^i - \textbf{w}^T\phi (\textbf{x}_i) - b \le \epsilon + \xi _i,&1\le i \le N,\\&-\vartheta ^i + \textbf{w}^T\phi (\textbf{x}_i) + b \le \epsilon + \xi _i^*, \quad&1\le i \le N,\\&\xi _i,\xi _i^* \ge 0,&1\le i \le N. \end{aligned} \end{aligned}$$
(12)

The dual form of this optimization problem is obtained through the minimization of a Lagrange function, which is constructed from the objective function and the problem constraints:

$$\begin{aligned} \begin{aligned} \max _{{\varvec{\alpha }},{\varvec{\alpha }}^*} \quad&-\frac{1}{2} \sum _{i,j=1}^N{(\alpha _i-\alpha _i^*)(\alpha _j-\alpha _j^*)K(\textbf{x}_i,\textbf{x}_j)}\\&- \epsilon \sum _{i=1}^N{(\alpha _i+\alpha _i^*)} + \sum _{i=1}^N{\vartheta ^i(\alpha _i-\alpha _i^*)}\\ \mathrm {s.t.} \quad&\sum _{i=1}^l (\alpha _i-\alpha _i^*) = 0,&1\le i \le N,\\&\alpha _i,\alpha _i^* \ge 0,&1\le i \le N,\\&-\alpha _i,-\alpha _i^* \ge -C,&1\le i \le N. \end{aligned} \end{aligned}$$
(13)

In the dual formulation of the problem, the function \(K(\textbf{x}_i,\textbf{x}_j)\) represents the inner product \(\langle \phi (\textbf{x}_i),\phi (\textbf{x}_j) \rangle \) in the feature space. Any function \(K(\textbf{x}_i,\textbf{x}_j)\) may become a kernel function as long as it satisfies the constraints of the inner products. It is very common to use the Gaussian radial basis function:

$$\begin{aligned} K(\textbf{x}_i,\textbf{x}_j)=\exp (-\gamma \left\| \textbf{x}_i-\textbf{x}_j\right\| ^2). \end{aligned}$$
(14)

The final form of the function \(g(\textbf{x})\) depends on the Lagrange multipliers \(\alpha _i,\alpha _i^*\) as:

$$\begin{aligned} g(\textbf{x}) = \sum _{i=1}^N (\alpha _i-\alpha _i^*) K(\textbf{x}_i,\textbf{x}). \end{aligned}$$
(15)

Incorporating the bias, the estimation of the objective function is finally made by the following expression:

$$\begin{aligned} \hat{\vartheta }(\textbf{x})=g(\textbf{x})+b=\sum _{i=1}^N (\alpha _i-\alpha _i^*) K(\textbf{x}_i,\textbf{x})+b \text{. } \end{aligned}$$
(16)

2.4 Ensemble methods

Ensemble methods overcome the (potential) limitations in the predictive performance of a single learning model by relying on the randomized combination of several of them (Zhou 2012). This paradigm assumes that combinations of several, simple ML models can greatly outperform the performance of a single such model (González et al. 2020), and rival the robustness or generalization capacity of complex ML, such as artificial neural networks, which involve a huge number of parameters.

2.4.1 Bagging

The basic idea behind bagging (bootstrap aggregating) is to train a set of simple models and combine their individual predictions as shown in Fig. 7. Bagging reduces the variance of the ML performance techniques and helps avoid overfitting, which is usually more severe in complex ML methods. Bagging establishes that all the base ML models which compose the ensemble have the same architecture, which results in the same topology, number of input–output variables and number of parameters to train. As an example, a set of decision trees trained with the bagging technique assumes that all trees have the same branches, with the same number of parameters and the same input–output variables (see Fig. 7). The individual models of the ensemble differ in the values that are learned for the model parameters, which are trained with different training sets.

Fig. 8
figure 8

Diagram of the AdaBoost algorithm exemplified for multi-class classification problems. Different size circles stand for samples with more associated weight (w) due to mis-classification in the previous step (marked with X)

The mathematical description of the bagging technique is as follows: Let \(\mathcal {D}=\{lbrace(\textbf{x}_i,y_i)\}_{i=1}^n\) be a given training set of n input–output pairs. The procedure of bagging, shown in Fig. 7, generates N new training sets, of size \(n'\), composed of samples from the set \(\mathcal {D}\), which can be repeated in each \(\mathcal {D}_{i}\). This sampling used for the creation of the sets is known as a bootstrap sample. Then, the parameters of N equal models \(\lbrace \mathcal {M}_i\rbrace _{i=1}^N\) are learned by training each model \(\mathcal {M}_i\) on the respective subset \(\mathcal {D}_{i}\). Finally, the ensemble model combines the individual outputs of each model by averaging their outputs (in the case of regression problems) or by majority voting (if dealing with classification problems) (Mohandes et al. 2018).

Bagging models can be deemed as the simplest way to create ensembles. Note that each base model \(\mathcal {M}_{i}\) is trained independently with no influence between each other. This property allows to train each base model in parallel, which drastically reduces the training time of the ensemble.

Random forests (RF) (Breiman 2001) are among the most commonly applied bagging techniques for classification and regression problems. They specifically use decision or regression trees as learners and differ from pure bagging techniques in that the topology of the trees is not universally fixed. Trees of the ensemble (the forest) may have different lengths, and topology, or use different input variables, which greatly increases the variability of the learners, but differs from the bagging paradigm from a theoretical viewpoint. The main advantage of RFs over traditional bagging is that by adopting slightly different models in their ensemble, the limitations of each are averaged out, resulting in improved generalization capacity (Breiman 2001).

The RF training procedure consists of the following steps. Let \(\lbrace ({\textbf {x}}_i,{y}_i)\rbrace _{i=1}^n\) be the training dataset. The main hyperparameters to be adjusted are: N, which is the number of estimators (namely, the number of tree learners composing the forest); and maxDepth, which is the maximum number of features to be explored as a node splitting criterion, which is often set to the square root of the number of features. Once these parameters are set, the method works as follows:

  1. 1.

    Initialize each one of the N decision or regression trees for the classification or regression problem respectively.

  2. 2.

    For each tree \({\textbf {T}}_t\), select \(n_t\) samples with replacement, by using the bootstrapping technique.

  3. 3.

    Only a subset of maximum maxDepth features shall be considered for the construction of each tree.

  4. 4.

    Each tree \({\textbf {T}}_t\) will give a solution.

  5. 5.

    The ensemble output of the random forest method will be computed by majority voting in the case of classification:

    $$\begin{aligned} \hat{Y}(\textbf{x}) = \mathop {\mathrm {arg\,max}}\limits _l\sum _{t=1}^N[{\textbf {T}}_t(\textbf{x})=l]. \end{aligned}$$
    (17)

    or averaging for regression problems:

    $$\begin{aligned} \hat{Y}(\textbf{x}) = \frac{1}{N}\sum _{t=1}^N\alpha _t {\textbf {T}}_t(\textbf{x}). \end{aligned}$$
    (18)

2.4.2 Boosting

Boosting approaches are an alternative family of ensemble algorithms which perform well in both classification and regression problems (Ferreira and Figueiredo 2012). Similarly to bagging, boosting follows the learning paradigm of using simple (or “weak”) ML models (classifiers/regressors), named learners, to form a powerful final model that combines their outputs. Also similarly to bagging, boosting establishes the same topology for all the learners involved in the ensemble (same architecture, number of input–output variables, and number of parameters to train). The most evident difference from bagging lies in the procedure for training weak learners. In bagging, the weak learners are trained in parallel using different subsets of data \(\mathcal {D}_i\) randomly sampled from the whole training dataset \(\mathcal {D}\). In boosting, the learners are trained sequentially (see Fig. 8). In this way, subsequent learners are dependent on previously trained ones, contrary to the learners in bagging methods. Furthermore, in boosting all the learners use the whole set of training data for computing their parameters, i.e, there is no bootstrap sample step.

Another important difference is that in bagging all input–output pairs are equally weighted to train each learner; each learner equally contributes to determine the final output of the ensemble model. In boosting, training input–output pairs are weighting according to the accuracy for being predicted by the previous learner (except for the first learner in the queue, which uses equally weighted samples). Consequently, learners are more specialized as soon as they are placed into the final locations along the queue. Furthermore, the contribution of each learner to the output of the ensemble is usually weighted according to its accuracy, which does not happen in bagging. This is the general scheme for all boosting methods, but there do exist different boosting strategies depending on the kind of weighting policy applied to each training sample, and/or the output of each learner.

A widely used boosting technique is Adaptive Boosting (AdaBoost). AdaBoost trains each weak learner in such a way that each learner focuses on the data that was misclassified by its predecessor so that learners further down the queue iteratively learn to adapt their parameters and achieve better results (Ferreira and Figueiredo 2012; González et al. 2020). Multiple variants of the AdaBoost algorithm exist, starting from the original one (Freund and Schapire 1997) designed to tackle binary classification problems, regression, or multi-class classification options. Figure 8 shows an outline of the AdaBoost algorithm for multi-class classification. The pseudocode for AdaBoost can be described as follows:

  1. 1.

    Let \(\mathcal {D}=\lbrace ({\textbf {x}}_i,y_i)\rbrace _{i=1}^n\) be the training dataset. The first step is to initialise each base learner \(\lbrace {\textbf {T}}_t\;\vert \;1\le t \le N\rbrace \), and assign the set of sample weights \(\lbrace {w}_i\;\vert \;1\le i \le n\rbrace \) corresponding to the input–output pairs \(\lbrace ({\textbf {x}}_i,y_i)\rbrace _{i=1}^n\) according to the uniform distribution: \({w}_i = \frac{1}{n}\).

  2. 2.

    For each base learner \({\textbf {T}}_t\), the training dataset is used with the distribution of weights \({w}_i\) for training.

  3. 3.

    After this training process, for each base learner \({\textbf {T}}_t\), the estimation error \(\epsilon _t\) is computed as:

    $$\begin{aligned} \epsilon _t=\sum _{{\textbf {T}}_t({\textbf {x}}_i)\ne {y}_i} \frac{w_i}{\sum _{{\textbf {x}}_i} w_i},\quad 1\le i\le n \end{aligned}$$
    (19)
  4. 4.

    From this error is derived the weight of the current base learner for the ensemble output \(\alpha _t\):

    $$\begin{aligned} \alpha _t = \log \frac{1-\epsilon _t}{\epsilon _t} \end{aligned}$$
    (20)
  5. 5.

    Finally, the distribution of the weights \({w}_i\) corresponding to each \({\textbf {x}}_i\), which will be used in the next learner, is proportionally adjusted to the probability that a sample is correctly estimated, and inversely proportional to the error of the learner \(\epsilon _t\).

  6. 6.

    The final output, provided by the algorithm globally, will be:

    $$\begin{aligned} \hat{Y}(\textbf{x}) = \mathop {\mathrm {arg\,max}}\limits _l\sum _{t=1}^N[\alpha _t ({\textbf {T}}_t(\textbf{x})=l)]. \end{aligned}$$
    (21)

    This final function refers to the boosting method for classification problems, which simply integrates the weighted output of individual learners by voting. In regression problems, the output consists of computing a weighted average of the outputs:

    $$\begin{aligned} \hat{Y}(\textbf{x}) = \frac{1}{N}\sum _{t=1}^N\alpha _t {\textbf {T}}_t(\textbf{x}). \end{aligned}$$
    (22)

The main difference of this algorithm with the multi-class variant AdaBoost.M1 (Freund and Schapire 1997) is that only the weight values of the correctly classified samples are lowered (\(w_i = w_i \frac{\epsilon _t}{1-\epsilon _t}\)).

2.5 Deep learning algorithms

When used for predictive modelling, machine learning revolves around modelling the statistical correlation between variables with respect to the target variable to be predicted. In problems dealing with spatial and/or temporal data (such as image classification or time series forecasting), such a correlation emerges from the relationship among data points over such domains. As a result, machine learning models can be either used in their seminal form to tackle spatiotemporal modelling tasks (by, e.g., extracting tabular features from data) or, instead, specialised into archetypes capable of supporting the modelling requirements stemming from such tasks (invariance to spatial transformations of the input or the characterization of long-term correlations over sequential data). Furthermore, continued advances in massively parallel computing and the explosion of non-relational databases containing information of assorted nature (e.g., image, video, audio, text) have spurred research efforts towards the derivation of neural network models of ever-growing modelling complexity, capable of efficiently discovering relevant predictors from highly dimensional data and endowing mechanisms to meet the requirements mentioned previously. Advances over the past 2 decades have blossomed into what is now known as deep learning (LeCun et al. 2015), which crystallizes in two main neural architectures: convolutional neural networks (CNNs (O’Shea and Nash 2015)) and recurrent neural networks (RNNs (Sherstinsky 2020)). Figure 9 illustrates two typical applications of these deep learning architectures in the context of EEs.

Fig. 9
figure 9

Examples of typical use cases related to extreme atmospheric events that can be tackled by deep learning models: a CNNs; b RNNs

When the correlation is held in the spatial domain, any model should be made invariant with respect to transformations of the input data that should not affect the prediction. This is the case of translational invariance in image classification, by which visual features relevant to the target to be predicted should retain their predictive importance no matter where they are located in the image. The way the human visual cortex operates to satisfy this requisite was the inspiration behind the design of CNNs, which, in their seminal form, comprise a series of hierarchically arranged neural processing layers. Layers closer to the input contain several convolutional neurons (also referred to as convolutional filters or kernels), which extract features from the input data by performing a convolution between the data themselves and the weights at their core. A CNNs for complex modelling tasks may stack several convolutional layers, one after another, so that each layer processes through its filters the output produced by the preceding layer. Some further processing layers can be placed in between convolutional ones, such as pooling layers, which serve to create information bottlenecks that help distil more high-level information while drastically reducing the number of parameters. After the convolutional part of the network, additional layers may be added depending on the application. For instance, in image classification a fully connected multi-layer perceptron is often attached to the end of a CNN to map this output to the target variables to be predicted. Analogously to MLPs, trainable parameters (weights and biases) of the CNN network can be learned by backpropagating error gradients through the network, which also holds for the weights of the convolutional kernels. Since gradients can be computed also for these special neural processing units, their weight values can be adjusted by means of different stochastic gradient descent solvers.

Beyond their benefits in terms of spatial invariance, learnable convolutional layers in CNNs provide several other advantages. First, the fact that gradients can be propagated allows for a massively parallel iterative update of their weights and biases, paving the way for implementations deployable on graphical processing units (GPU) and tensor processing units (TPU). Another advantage of CNNs is the hierarchy of visual features learned by the network, which becomes progressively more specialized for the task at hand as more convolutional layers are stacked on top of each other. This offers a more structured interpretability of the knowledge captured by the layers, which can be disentangled by using deconvolutional filters or local explainability techniques (Zhang and Zhu 2018). But perhaps most interestingly, coarse visual features modeled in the first convolutional layers (edges, primitive shapes, etc.) learned on one task can be useful for others. Such tasks could leverage this general-purpose learned knowledge by importing pretrained weights and biases of such layers into their CNN architectures, so that the requirements in terms of learnable parameters or annotated data can be reduced. This simple yet effective knowledge exchange mechanism is referred to as transfer learning (Zhuang et al. 2020; Weiss et al. 2016) and has helped the adoption of CNNs in environments with scarcely annotated data or limited computational resources.

Sophisticated CNN architectures nowadays constitute the state-of-the-art for image and video classification modelling tasks, incorporating new ideas that boost even further their performance and/or efficiency. This is the case of capsule networks (Hinton et al. 2011), attention mechanisms (Vaswani et al. 2017), or patch-based learning in visual transformers (Han et al. 2020). When it comes to efficiency, the inner working of spiking neural networks (Grüning and Bohte 2014) has been investigated to alleviate the consumption of computing resources of these models. It is worth noting that the number of trainable parameters in CNNs may amount up to several tens of millions in very deep models, leading to problematically long training times, large storage requirements, and energy consumption footprints (Anthony et al. 2020). Finally, an important area of research is on the development of interpretability techniques for CNNs, which aim to dissect the knowledge captured by the layers of an already trained CNN (Arrieta et al. 2020). The result of this dissection, which can take many forms (e.g., attribution maps, counterfactual explanations, or simplified rule sets) is offered as an interpretable interface for the user to understand how and why the CNN provides its output. We will later elaborate on the plethora of possibilities of explanation techniques for CNNs used in EEs modelling and characterization tasks.

Different from CNNs, RNNs are built for modelling relationships in sequential data, including text and time series. Modelling such correlations requires that the network be capable of modelling, exploiting, and maintaining information (memory) at their neural processing steps, such that long-term relationships over the sequence can be exploited effectively when solving modelling tasks. In RNNs, this is accomplished by formulating a recurrent form of a neural processing unit, in which part of the output of the neuron is fed back to its input to realize a sort of neural memory. This new recurrent formulation of a neuron endows it with the possibility to learn and store information about the past that is relevant to the problem under consideration. For instance, this property of RNNs is key in time series forecasting, where the temporal lags to be predicted can be affected by data occurring far back in time. When RNNs are used for this task, the memory conferred to the neurons permits to model correlations over the sequence at different time scales. As the convolutional filters in a CNN, the parameters controlling how much of the output of a neuron is fed back to its input or stored in the hidden state vector can be learned via gradient backpropagation. The history of RNNs dates back to the work by Jordan (1997) and Elman (1990). Thereafter, the well-known long short-term memory networks (LSTM (Hochreiter and Schmidhuber 1997)) and the more recently proposed gated recurrent units (GRU (Cho et al. 2014)) became the standard in recurrent neural computation. LSTMs rely on several trainable parameters (gates) to control which parts of the sequence flow into the neuron by releasing or retaining information inside the hidden state vectors of neurons. GRU networks can be regarded as a variant of LSTMs that features small architectural modifications that permit to reduction the number of trainable parameters. In both cases, recurrent neural processing units can be arranged in a hierarchical structure comprising several stacked layers, in such a way that correlations are captured at different scales and levels of granularity. Several RNN approaches have been proposed in the literature over the years to overcome the drawbacks of the training process of these models. Attention mechanisms for instance (also applied in other types of deep networks such as CNN), make networks focus on certain parts of the input when predicting its output, discarding information that is not relevant for that specific input. Similarly, bidirectional RNNs aim at considering future steps of the sequence in the output of the neuron (Schuster and Paliwal 1997)). Recurrent networks that do not hinge on gradient backpropagation have also been developed in recent years, with reservoir computing and particularly echo state networks (Lukoševičius and Jaeger 2009; Gallicchio and Micheli 2017) being at the frontline. Finally, recent studies have emphasized that specialized CNNs for sequence modelling such as Temporal Convolutional Networks (TCN (Lea et al. 2017)) demonstrate longer and more effectively trained memory capabilities over diverse tasks and datasets, showcasing the potential of convolutional architectures also to address problems over sequential data.

3 Review of existing literature

This section critically analyzes and discusses the existing literature related to ML in atmospheric EEs. The methodology applied has been the following: we perform a large number of search queries in well-known scientific publication databases, including Google Scholar, Scopus, and Web of Science. We systematically introduce a specific set of query strings in order to discover published works related to ML in atmospheric EE. We have used the term ML together with extreme atmospheric events, plus extreme rainfall, flood prediction, heatwaves prediction, extreme temperature prediction, droughts prediction, convective systems, tropical cyclones prediction, hail and hailstorms, extreme wind gusts, or low-visibility prediction, among many other terms linked to atmospheric EE. Once all results were retrieved from the aforementioned databases, we removed duplicates and performed an exhaustive analysis and discussion on a paper-by-paper basis, towards ascertaining their alignment with the topic under study. This systematic review process gave rise to the review and analysis that we present in the subsequent sections.

Figure 10 summarizes the hierarchical categorization of the state-of-the-art methods for atmospheric EEs problems. We classify the works according to the atmospheric event they predict, and then, using the type of ML methods they involve. Some works are included in several boxes since they apply several ML methods in EEs prediction problems.

Fig. 10
figure 10

Summary of recent works dealing with ML, DL, and related techniques in atmospheric extreme event problems

3.1 Extreme rainfall and floods

Destructive extreme precipitation events and flooding episodes are a real threat to human settlements in different parts of the world (Madsen et al. 2014; Berghuijs et al. 2017). Extensive research on the monitoring, prediction and analysis of these events has been carried out in the literature. We analyze here those works dealing with ML techniques. Note that a first review on ML for flood prediction can be found in Mosavi et al. (2018), where the state of the art in this topic can be found, up to 2018. In Moon et al. (2019), a ML-based early warning system for short-term heavy rainfall is proposed for Korea. The system is formulated as a binary classification problem, where a logistic regression has been implemented over predictive variables from meteorological data obtained from automatic weather stations, which have been previously preprocessed by applying a principal component analysis algorithm. A comparison against early warning systems formed by alternative classifiers is carried out. An important amount of meteorological variables measured at different locations feed the classifiers in real-time, in order to improve the performance of the classification output. In Diez-Sierra and del Jesus (2020), a number of ML methods (SVM, k-nearest neighbours, RF, k-means clustering and neural networks) are applied to a problem of long-term rainfall prediction, using the atmospheric synoptic patterns as predictive variables. Neural networks are reported as the most accurate method, but surprisingly, the work reports the generalized linear method with gamma-distributed errors as the best method to predict the extreme of the series, improving the performance of the ML approaches. Note that supervised and non-supervised methods (k-means) are tested together, and depending on the method, a classification or regression problem is considered, which is an unusual procedure in the application of ML techniques. Results considered as ground truth rain gauges measurements from Tenerife (Canary Islands, Spain), are discussed. In Schlef et al. (2019), a self-organized map is used to obtain clusters of synoptic situations leading to extreme floods across USA. Then the flood characteristics of each synoptic situation are analyzed, identifying four primary categories of circulation patterns with different flood potential hazard. This methodology also allows identifying regions where extreme floods occur outside the normal flood season, and other regions where multiple extreme flood events occur within a single year, mainly due to tropical cyclones.

In Nayak and Ghosh (2013), a support vector machine is applied to short-term prediction of extreme precipitation in Mumbai, India. The prediction time horizon has been set in this case between 6 and 48 h. The predictive variables consist of mesoscale and synoptic scale weather patterns. The work identifies specific weather patterns for extreme precipitation events, finding out that they are different for nighttime precipitation or daytime extreme precipitation events. The SVM is then used to obtain extreme rainfall classification and prediction.

In Vandal et al. (2019), a problem of extreme precipitation statistical downscaling of GCM is tackled with ML algorithms. Five-ML methods are compared in this task: ordinary least squares, elastic-net, and support vector machine, sparse structure learning (MSSL) and autoencoder neural networks. Experiments with data from Northeastern United States suggest that the direct application of ML techniques does not improve the results of simpler statistical-based methods in the downscaling of extreme precipitation events.

In Grazzini et al. (2020), the classification of precipitation extreme events in northern-central Italy is carried out by means of K-means clustering and RF algorithm. The study reports the importance of integrated water vapour transport variable in the correct detection of extreme precipitation events in this region. This work has been complemented with a second study for the same zone, where the authors investigate the connection between precipitation extremes and Rossby wave packets (Grazzini et al. 2021). In Jahangir et al. (2019), an ANN algorithm is applied for the prediction of discharge values and spatial modelling of floods in Kan River Basin, Iran. Similarly, in Yeditha et al. (2020), different ML models (mainly neural networks) with a previous data treatment by wavelets are applied to forecast extreme precipitation from satellite measurements. The proposed approach has been tested in the prediction of floods in Vamsadhara river basin, India.

In Hosseini et al. (2020), a problem of flash flood forecasting with ML algorithms is tackled. The paper analyzes an ensemble of boosted generalized linear models random forest, and Bayesian generalized linear models algorithms. A pre-processing step for reducing the number of input variables with a Simulated Annealing algorithm is carried out. These approaches are tested in the prediction of flash floods in the North of Iran. In Hu and Ayyub (2019), a Gradient Boosting Tree algorithm is applied to perform projections of precipitation intensity over short durations events, using outputs from GCMs. The algorithm performance has been tested in observational data (25 years of data) across USA. In Bui et al. (2019), an approach for flash flood susceptibility modelling is proposed. The algorithm combines tree-based ensemble with a pre-processing step of feature selection using a fuzzy-rule method and a Genetic Algorithm. These approaches have been combined with different tree-based ensembles such as LogitBoost, Bagging and AdaBoost algorithms. The performance of the systems was tested in data from Lao Cai Province (Northeast Vietnam). In Choi et al. (2018), different ML classification techniques such as decision trees, bagging, RF or boosting have been applied to the prediction of heavy rain damages at Seoul (South Korea). The work uses data on the occurrence of heavy rain damages in the city from 1994 to 2015, obtaining accurate results specially with the boosting technique. In Yang et al. (2023), a RF approach was applied in a problem of monthly extreme precipitation prediction from meteorological variables in Southern China. Data from 99 measuring stations near the Yangtze River are considered in this problem. The intrinsic RF feature importance is used to describe the physical mechanisms of extreme precipitation. In Pirone et al. (2023), a short-term precipitation prediction based on ML algorithms (ANNs) is proposed. The model employs cumulative rainfall fields from different stations data in Italy as inputs for the neural network and the idea is to predict rainfall interval and the corresponding probability of occurrence. In Lin et al. (2023), an ensemble method based on ML approaches RF, eXtreme Gradient Boosting (XGB) and ANNs is proposed to spot the key contributing variables to monthly extreme precipitation intensity and frequency in six different regions over the United States. In Vitanza et al. (2023), the Affinity Propagation algorithm, a clustering algorithm based on ML, was applied to a problem of extreme rainfall areas in Sicily, Italy. This approach does not require the number of clusters to be determined or estimated before running the algorithm, and it works based on the concept of “message passing” between data points. In this case, it was applied over a high-frequency, large dataset collected in the zone of study from 2009 to 2021, confirming the presence of recent anomalous rainfall events in eastern Sicily.

DL-based approaches have been recently applied to flood prediction, and it is expected that they are predominant in the years to come. In Shi (2020), convolutional neural networks (CNN) are used to carry out a smart dynamical downscaling of extreme convective precipitation from Global Climate Models (GCM). This work shows that when trained with data for three subtropical/tropical regions, CNNs are able to retain between 92 and 98% of extreme precipitation events. In Moishin et al. (2021), a CNN with LSTM Network has been introduced to forecast the future occurrence of flood events. The performance of this deep learning approach has been tested in 9 different rainfall datasets of floods that occurred in Fiji. In Xie et al. (2021), a problem of short-term intensive rainfall prediction was tackled with deep learning approaches. ECMWF forecast data and ground observation station data were taken into account, and K-means, generative adversarial nets and deep belief networks were applied to obtain the prediction as a classification model. Experiments in data from the Fujian Province (southeastern China) in the period 2015–2018, showed a good performance of the proposed prediction approaches, improving the results of LSTM and Stacked Sparse AE networks. In Manna and Anitha (2023), the integration of Rough Set on Fuzzy Approximation Space (RSFAS) with a deep learning (DL) technique is proposed in a problem of precipitation level in India. The idea is that RSFAS handles the uncertainty of the prediction, and the DL technique (an LSTM network) solves the associated classification and prediction problem. In Badrinath et al. (2023), a CNN is proposed to capture complex spatial precipitation patterns of precipitation, trying to identify and reduce biases affecting predictions of the dynamical model. The method is specifically based on a modified U-Net CNN, to postprocess daily accumulated precipitation over the United States West Coast. In Folino et al. (2023), an ensemble of deep neural networks is proposed for a problem of precipitation prediction in Italy, using heterogeneous data sources such as rain gauge measurements, radar and geostationary satellites. In Choudhary and Ghosh (2023), different types of DL networks such as RNN and LSTM have been applied to model monthly rainfall intensity and other climatic variables, such as temperature, in Jodhpur, India. The study shows that the LSTM obtains the best prediction results in this particular problem. In Chen et al. (2023), a DL model called weighted U-Net (WU-Net) is proposed for the problem of extreme precipitation prediction in China. This approach incorporates sample weights from different precipitation events to improve the forecasts of other intensive precipitation events over China. In Barnes et al. (2023), an approach combining ECMWF SEAS5 seasonal forecasts with CNNs is proposed to improve the forecasting of total monthly regional rainfall across Great Britain. An explainable analysis of the synoptic situations leading to specific CNN results is carried out.

Finally, in close connection with ML approaches, Complex Networks (CN) have also been used to analyze problems of extreme precipitation. In Boers et al. (2019), the teleconnections of extreme events over the world are studied, using the CN paradigm over high-resolution satellite data. The CN methodology confirms Rossby waves as the physical mechanism behind global teleconnection patterns in extreme precipitation events.

3.1.1 Analysis

As a final note on the application of ML models to EEs related to rainfall and floods, we have found ML approaches in very different applications, including short-term and long-term detection and prediction problems, tackled with different ML frameworks (classification and regression) and considering very different prediction (or detection) time horizons. It is also remarkable the different ways in which many of these approaches introduce the physics of the problem within their approaches. In some cases, mainly in short-term prediction problems, the revised works consider real-time meteorological variables to feed ML algorithms, such as in Moon et al. (2019). In other cases, the ML extract information from synoptic patterns, mainly in problems of long-term rainfall and flood prediction (Diez-Sierra and del Jesus 2020; Schlef et al. 2019). In other cases, the output of GCM are treated with ML approaches in order to obtain improvements on the prediction of heavy precipitation events (Shi 2020; Vandal et al. 2019; Hu and Ayyub 2019). Other ML approaches rely on specific variables from reanalysis data but include in the studies variables with physical sense, such as sensitivity to flow conditions and other representatives of thermodynamic conditions for extreme precipitation events modelling, such as (Grazzini et al. 2020). A final group of works have been revised which only rely on measurements or set of data, without any specific consideration of the physics of the problem, especially when DL has been applied (Moishin et al. 2021; Xie et al. 2021), but also with shallow ML approaches (Choi et al. 2018). In these last cases, the works analyzed seem to focus on the ability of ML approaches to extract information and obtain accurate predictions, evaluated from different metrics, and compared against other ML approaches, with very few references to the physical processes causing the EE. It is possible to see how, in the last years, the amount of DL-based approaches has increased a lot, and it is expected that in the near future, DL techniques will dominate the research on extreme precipitation prediction (Chase et al. 2023). Finally, the work in Boers et al. (2019) analyzes extreme precipitation events from CN paradigm, generating networks which take into account the physics of the problem and the relationship among different variables involved in the problem, including the analysis of teleconnections. This introduces a novel paradigm in the study and analysis of extreme precipitation, which may be hybridized with ML techniques in the near future.

3.2 Heatwaves and extreme temperatures

Extreme temperatures (Barriopedro et al. 2011; Pfleiderer and Coumou 2018), heatwaves (Chapman et al. 2019; Barriopedro et al. 2023) and, in the last decades, mega-heatwaves (Bador et al. 2017; Sánchez-Benítez et al. 2018) are among the extreme atmospheric events potentially most dangerous for people, especially the elderly (Díaz et al. 2002a, b) and with deep societal impact. The detection, prediction and attribution of heatwaves and extreme temperatures is, therefore, a hot topic in atmospheric EEs research (Wang et al. 2017), including the study of natural causes such as circulation patterns (Shi et al. 2018) or anthropogenic contribution (Zwiers et al. 2011). ML methods have been applied to study these and other aspects of extreme temperatures and heatwaves (Cifuentes et al. 2020).

3.2.1 Heatwaves

In Pasini et al. (2017), neural computation is used in a problem of attribution of heatwaves. The study considers the last 160 years, where the attribution to anthropogenic forcings is obtained for the last 50 years, whereas in the period 1910-1975 the main driver is solar irradiation. The study also clarifies the role of aerosols and the Atlantic Multidecadal Oscillation in decadal temperature variability.

In Park and Kim (2018), multivariate adaptive regression splines are used to set appropriate heatwave thresholds, in order to improve early warning systems for these events. The work uses daily data of emergency patients diagnosed with heatstroke and also information on 19 meteorological variables obtained for the years 2011 to 2016. The results obtained show that the combination of heat illness data and average daytime temperature (from noon to 6 PM) can be used as an alternative threshold for heatwaves characterization. Finally, in Chattopadhyay et al. (2020), a hybrid approach combining the Analog prediction method (search of analogue synoptic situations in the past) with deep neural networks (capsule neural networks, CapsNets) is proposed to predict heatwaves and cold spells. The proposed CapsNets outperformed other deep approaches such as CNN and alternative prediction algorithms such as logistic regression techniques. Finally, in a recent work (Weirich-Benet et al. 2023) the performance of linear regressors and RF algorithms in a problem of subseasonal heatwaves prediction is discussed. Different inputs (drivers) are previously chosen by using a correlation-based analysis.

3.2.2 Extreme temperatures

One of the first approaches in the application of ML techniques for extreme temperature prediction was Abdel-Aal and Elhadidy (1995), where different artificial neural network models are applied to a problem of daily maximum temperature prediction in Dhahran, Saudi Arabia. In this case, daily data for 18 weather parameters are considered as input variables, to predict the maximum temperature on a given day, with different prediction time horizons up to 3 days in advance. In Paniagua-Tineo et al. (2011), a SVR algorithm is used to forecast daily maximum air temperature with a 24 h prediction time horizon. The prediction system relies on a number of input variables such as air temperature, precipitation, relative humidity and air pressure. It also considers the synoptic situation of the day in order to improve its results. The performance of the SVR algorithm has been successfully evaluated with data from a number of European measurement stations. In De and Debnath (2009), the prediction of the maximum (and minimum) air temperature in the summer monsoon season is carried out by using a multi-layer MLP perceptron neural network. The mean temperature of previous months in the period of analysis is considered as input for the system. Data from the Indian Institute of Tropical Meteorology belonging to the years 1901–2003 are considered.In Chithra et al. (2015), neural networks are applied to a problem of monthly mean maximum and minimum temperature in Chaliyar river basin, India. The objective is to evaluate the impact of climate change in the accuracy of the predictions obtained by neural networks. In Ahmed et al. (2020), different ML approaches such as MLP, SVM and relevance vector machine (RVM) or K-nearest neighbour (KNN), are proposed to develop multi-model ensembles from global climate models. The objective is to obtain annual predictions of monsoon and winter precipitation, maximum temperature and minimum temperature over Pakistan. The results obtained have shown that KNN and RVM-based multi-method ensembles show better skills than those developed with MLP and SVM.In Peng et al. (2020), a MLP and a natural gradient boosting algorithm (NGBoost), are applied to improve the prediction skills of the 2-m maximum air temperature, with a prediction time horizon with lead times from 1 to 35 days. The ML prediction approaches have shown better results than the ensemble model output statistics (EMOS) method (which was selected as the benchmark for comparison) in 90% of the cases analyzed. In Oettli et al. (2022), a number of ML algorithms such as neural networks, SVMs, RF, Gradient Boosting or regression trees have been applied to the prediction of surface air temperature two months in advance, with input data two months in advance from SINTEX-F2, a dynamical prediction system. The dynamical prediction system includes the physics of the problem, while the ML algorithms improve the results by a statistical downscaling. The performance of these approaches has been tested in Tokio (Japan), obtaining excellent prediction results.In Gómez-Orellana et al. (2023), a problem of long-term air temperature prediction with eXplainable Artificial Intelligence (XAI) algorithms is tackled. Specifically, artificial neural networks trained with evolutionary algorithms are tested on this problem. This XAI model architecture has been applied to the long-term air temperature prediction at different sub-regions of the South of the Iberian Peninsula, with good performance results.

Very recently, DL approaches have been applied to long-term extreme temperature prediction problems, such as in Nandi et al. (2022), where an approach called Attention-based Long-term Temperature Forecasting Network is proposed. This approach uses an Encoder-Decoder system similar to that shown in Sect. 2.1.1. The Encoder encodes the relative dependencies of the auto-regressive time series into an attention tensor (dimensionality reduction) which is used by the Decoder to produce the prediction. The Encoder is augmented to incorporate a convolution block to recognize the seasonal patterns associated with extreme temperatures. The model was evaluated in real data from five different cities around the world. In Fister et al. (2023), different DL algorithms have been tested in a problem of extreme air temperature forecasting. Different DL prediction approaches have been tested, including a Convolutional Neural Network (CNN) with video-to-image translation, several ML approaches including Lasso regression, Decision Trees and Random Forest, and finally a CNN with pre-processing step using Recurrence Plots, which convert time series into images. Good prediction skills have been obtained for two cases of extreme temperature in Paris and Córdoba, Spain.

3.2.3 Analysis

The works revised in this subsection reveal that there are not many works dealing with heatwave prediction using ML approaches. Only a few specific works on the application of ML techniques to heatwave estimation have been found in the recent literature. In Park and Kim (2018), the work uses data from meteorological variables and emergency patients in order to obtain a characterization of heatwaves. A second approach discussed heatwaves prediction with ML (Chattopadhyay et al. 2020). Here, ML algorithms (DL networks in this case) are merged with the Analog method which introduces the physics of the problem in order to predict heatwaves. A recent paper Weirich-Benet et al. (2023) discusses how linear regression and RF can be successfully sued in a problem of heatwaves prediction.

There are many more works on ML algorithms for extreme temperature prediction problems. Artificial neural networks and statistical ML approaches are the main algorithms applied in the literature to tackle these problems. It is interesting to see how in these works, the inclusion of physics is not as relevant as in the works dealing with ML algorithms for rainfall and flood prediction. The reason for this is that air temperature is in general a variable easier to be predicted than rainfall, in which the inclusion of the atmospheric state and dynamics is key to obtain good results. Synoptic situations (considered in Paniagua-Tineo et al. (2011)) seem to improve the results of ML algorithms in the prediction of extreme temperatures. In the rest of the articles revised, the prediction is based on existing registers of previous temperatures. The application of ML approaches produces good results in this case in weekly or monthly temperature predictions, where the variation of the extreme temperatures is small.

3.3 Droughts

Droughts are extreme events, stochastic in nature, with a deep impact on society, specifically on water supplies, agriculture, and hydroelectric power production, and associated with forest fires and even forced migrations (Spinoni et al. 2019; García-Herrera et al. 2019). Drought early warning systems provide important information about predicted drought hazards. In many cases, these systems rely on ML and DL algorithms.

In Sutanto et al. (2019), a RF algorithm is used to forecast drought impacts, by relating forecasted hydro-meteorological drought indices to previously reported drought impacts. The proposed model based on ML is able to forecast drought impacts with prediction time horizons of some months ahead. In Khan et al. (2020), different ML classification techniques are applied to develop drought prediction models over Pakistan. They include SVM, MLP and KNN algorithms. Meteorological variables from reanalysis are considered as inputs, whereas the objective variable considers three categories of droughts: moderate, severe, and extreme in different cropping seasons. These classes were estimated using the Standardized Precipitation Evaporation Index (SPEI; Vicente-Serrano et al. (2010)), in order to train and test the proposed ML classifiers. In Rhee and Im (2017), a problem of high-resolution spatial drought forecasting is tackled in Korea from remote sensing and climate indices inputs. The performance of different regression tree algorithms, RF and Extremely randomized trees have been compared. In Park et al. (2016), different ML algorithms such as RF boosted regression trees, and Cubist is applied to model meteorological and agricultural droughts from 16 inputs drought factors obtained from satellite measurements. The SPI and crop data are used as objective variables to model the droughts. RF has been reported as the best performing algorithm in data from arid zones of the United States. In Rahmati et al. (2020), drought hazard is tackled with different ML models: classification and regression trees (CART), boosted regression trees (BRT), RF, multivariate adaptive regression splines (MARS), flexible discriminant analysis (FDA) and SVM. Some Hydro-environmental datasets are used to calculate the relative departure of soil moisture (RDSM), and this index is used as an objective variable, whereas the inputs are eight environmental factors as potential predictors of drought. Experiments in the southeast part of Queensland, Australia, are carried out to evaluate the performance of the different ML methods proposed. In Feng et al. (2019), three ML algorithms (RF, SVM and MLPs) are used to evaluate whether remotely-sensed drought factors (satellite measurements) are good estimators for drought events prediction in south-eastern Australia. RF is again the ML regression technique which best results obtains in this problem, outperforming SVM and MLPs in this task. In Belayneh and Adamowski (2013), short-term drought prediction in the Awash River Basin (Ethiopia) is considered, by means of SPI prediction. Three ML methods are evaluated for this problem, MLP, SVM and MLP with a previous step of wavelets signal decomposition. The coupled wavelet-MLP algorithm showed the best result in SPI prediction with a prediction time horizon of 1 month and 3 months. New results and further analysis on the same problem were reported in Belayneh et al. (2016). In Belayneh et al. (2014), a long-term drought prediction problem in the Awash River is considered by means of MLPs and SVMs, enhanced with wavelets transforms. The SPI at 12 and 24 months (SPI 12 and SPI 24) are predicted by means of the ML methods. Comparison with ARIMA methods for time series prediction shows a better performance of the ML techniques. The same data from Awash River Basin are used in Belayneh et al. (2016) to test advanced versions of ML algorithms in the same problem of drought prediction. Coupled versions of ML algorithms with wavelet transforms are considered, such as wavelet transforms with Bootstrap and Boosting ensembles together with MLP and SVR models. These coupled models show a better performance than the MLP and SVR algorithms on their own. In Roodposhti et al. (2017), a problem of drought sensitivity mapping based on SPI index and enhanced vegetation index (EVI) is tackled, by using one-class SVMs. Data from both synoptic stations and satellite data are combined in this study in the Iranian province of Kermanshah. In Piri et al. (2023), different ML approaches based on ANNs and SVRs with evolutionary-based feature selection mechanisms are proposed to predict different meteorological drought indices for different measurement stations in Iran. In Mokhtari and Akhoondzadeh (2021), ML algorithms such as ANN, SVR, DT or RF are applied to a problem of drought prediction for monthly periods, using inputs derived from the active and passive sensors of different satellite sensors. In Deo and Şahin (2015), the performance of the ELM algorithm is evaluated in a problem of Effective Drought Index prediction in eastern Australia. Predictive variables composed of meteorological variables and climate indices are considered. The ELM approach outperformed the results of different neural network models. In Aghelpour et al. (2020), different ML approaches are evaluated in the problem of forecasting the precipitation joint deficit index (JDI) and the multivariate standardized precipitation index (MSPI), both of them related to severe droughts. Different ML methods are considered, such as group method of data handling (GMDH), generalized regression neural network (GRNN), least squared support vector machine (LSSVM), adaptive neuro-fuzzy inference system (ANFIS) and ANFIS optimized with meta-heuristics algorithms. Experiments in data from 10 measuring stations in Iran are considered. The GMDH method is reported as the most accurate algorithm. In Zhang et al. (2019), artificial neural networks and XGB algorithms with feature selection by means of a cross-correlation function and a distributed lag nonlinear model (DLNM) are considered in a problem of drought prediction. Data from 32 stations from 1961 to 2016 in Shaanxi Province, China, are used. The results show that the XGB approach outperforms neural networks and the DLNM works better than the cross-correlation function in the selection of the best features for this prediction problem. In Dikshit et al. (2020), MLP and SVR algorithms are tested in a problem of drought prediction in New South Wales, Australia. SPEI index at 1, 3, 6, and 12 months are used as objective values. The results obtained suggest that the MLP outperforms SVM. The results also discard that sea temperature and climate indices had a real impact on the droughts in New South Wales. In Richman and Leslie (2018), a feature selection problem is considered for attribution of the Cape City drought 2015–2017 with ML algorithms. Wrapper algorithms for FSP are considered, in which the SVM has been used as a classification algorithm, and different evolutionary algorithms look for the best set of features (drought drivers) for predicting the cool season precipitation in the years of the drought. In Pande et al. (2023), different SVM versions were tested in a problem of drought prediction in the upper Godavari River basin, India. The SPI index was used as the objective variable to predict future droughts in the zone. In Li et al. (2021), the role of antecedent SST fluctuation pattern (ASFP) as a drought driver is analyzed by using ML techniques such as SVR, RF and ELM. The SPEI is used as an objective to be predicted at different river basins such as Colorado, Danube, Orange, and Pearl Rivers. The obtained results showed that the ASFP-ELM model can effectively predict the space-time evolution of drought events outperforming the rest of the ML algorithms considered. In Prodhan et al. (2022), RF and gradient boosting machine algorithms are applied to characterize future drought metrics and their impact on crops. The magnitude, intensity, and duration of future droughts are characterized by means of the SPEI drought index using CMIP6 (Coupled Model Inter-comparison Phase-6) climate models data. Experimental results on Southern Asia, including countries such as Afghanistan, Pakistan, and India are analyzed.

Very recently, DL algorithms have been also applied to different problems in drought prediction. In Gyaneshwar et al. (2023), a review of the most important DL algorithms with application in drought prediction is presented. The work also includes a number of ML approaches for drought prediction. In Mokhtar et al. (2021), four ML and DL methods (RF, XGB, CNNs and LSTMs) were considered in a problem of SPEI estimation in the Qinghai-Tibet Plateau. Meteorological variables and climate indices are considered predictive variables. In Abbes et al. (2023), a DL-based approach for drought forecasting based on combining Long Short-Term Memory (LSTM) and Multi-Resolution Analysis Wavelet Transform is proposed. Experiments in data from the Sarab region (Iran) based on the standardized precipitation Evaporation index (SPEI) prediction showed a good performance of this DL-based approach. In Kaur and Sood (2020), different ML and DL approaches such as ANN, ANN optimized with Genetic Algorithm and Deep Neural Networks, all hybridized with a SVR algorithm, are tested in a problem of drought prediction. Their performance is compared showing that the deep neural network was the best-performing approach in drought prediction. In Vo et al. (2023), a hybrid model involving DL (LSTM networks) is a climate model for drought prediction. The proposed hybrid DL-based systems were tested in real data from South Korea. In Danandeh Mehr et al. (2022), a hybrid intelligent DL-based model for drought prediction, formed by the combination of CNN and LSTM networks was proposed. This approach was tested in a drought prediction problem with multi-temporal drought indices (SPEI-3 and SPEI-6) as objectives, in the Ankara region, Turkey.

In close connection with drought forecasting, evaporation prediction has been tackled in some cases. For instance, Yaseen et al. (2020) evaluates ML approaches for evaporation prediction in arid regions of Iraq. Four different ML models are considered including classification trees, a cascade correlation neural network, a gene expression programming (GEP), and a SVM algorithm. Another recent work dealing with alternative prediction problems related to drought forecasting is, Tufaner and Özbeyaz (2020) where the Palmer Drought Severity Index (PDSI) is predicted by using different ML algorithms. SVM, MLP and decision trees have been applied to this problem, and their results compared to a Linear Regression algorithm used as a baseline technique. Results in a problem of PDSI prediction in Anatolia (Turkey), have shown that the MLP obtains the best results. Finally, Adikari et al. (2021) evaluates the performance of three different ML algorithms (convolutional neural networks (CNN), long-short term memory network (LSTM), and wavelet decomposition functions combined with the adaptive neuro-fuzzy inference system (WANFIS)) in two different problems of flood and drought forecasting. The results obtained reveal that CNNs is the best-compared approach for flood forecast and WANFIS outperforms the other two algorithms in drought forecasting.

3.3.1 Analysis

The review of articles about ML techniques for drought and related problems has shown a large number of ML algorithms applied to drought prediction and analysis. Ensemble methods such as RF seem to be strong approaches for prediction problems related to drought, though other algorithms such as neural networks or statistical learning approaches (SVMs) have also shown to be strong possibilities. DL-based algorithms have also been successfully applied to different drought prediction cases, mainly in the last few years. The inclusion of the physics is, in the majority of cases, treated by means of considering climate indices among the predictive (input) variables of the problems, though some approaches such as Dikshit et al. (2020) have discarded that climate indices improve as predictive variables improve the performance of ML algorithms in specific problems of drought prediction. In Vo et al. (2023), a hybrid approach which directly involves a DL algorithm and a climate model is proposed for drought prediction. In general, processes related to atmospheric dynamics seem to dominate this phenomenon, so the inclusion of climate indices as inputs for ML algorithms seems a reasonable election in order to capture the physics of the problem. Regarding the objective variables for defining the problem, the majority of problems analyzed used precipitation indices such as SPI or SPEI, as drought indicators.

3.4 Severe weather

EEs related to severe weather have also been studied and analyzed with ML methods in the last few years. We have divided this subsection into different parts, ML methods in convective systems studies, tropical cyclones, hailstorms and extreme wind and gusts.

3.4.1 Convective systems

There are different works focused on the study of convective clouds and systems formation and related events with ML approaches (Xiu et al. 2016; McGovern et al. 2023).

In Tebbi and Haddad (2016), a problem of convective cloud classification by means of the combination of ANN and SVM, using high-resolution satellite images in northern Algeria is tackled. The proposed system works in two steps. First, the system detects rainy areas in cloud systems, and second, it delineates convective cells from stratiform ones. In Sahoo and Bhaskaran (2019), a problem of storm surge and coastal floods prediction with artificial neural networks is tackled. The work is focused on Odisha state (India), trying to simulate the effects of the tide caused by the super cyclone of 1999. Comparison with the ADCIRC prediction model Luettich et al. (1992) shows that the ML-based model is able to obtain significant results in the prediction of storm surge and associated flood of Odisha event. In Guijo-Rubio et al. (2020), a problem of classification of convective situations over Madrid-Barajas airport is tackled, with neuro-evolutionary techniques (neural networks trained with evolutionary computation techniques). The problem is considered a multi-class classification problem, highly imbalanced (there are much less convective situations than clear days). However, the neuro-evolutionary approaches are able to obtain an accurate performance in the identification of days with convective cloud formation in Madrid airport. A similar problem is tackled in Guijo-Rubio et al. (2020) by considering ordinal regression techniques instead of classification. Another study is presented in Jergensen et al. (2020), where a problem of thunderstorms classification is tackled with different ML approaches, such as logistic regression algorithms, RF, gradient-boosted forests and SVMs. The problem has been formulated as a multi-class classification problem, in which the gradient-boosted forest algorithm obtained the best classification results. In Hill et al. (2020), the RF algorithm is evaluated in problems related to convective systems. The study includes different EEs from convective systems such as the presence of tornadoes, large hail (over 1 inch) or induced wind gusts over 58 mph. A large number of predictive variables are considered in this study, including different atmospheric fields such as 10-m winds, surface temperature and specific humidity, precipitable water, accumulated precipitation, and wind shear from the surface at different pressure levels or mean sea level pressure, among others. The RF algorithm was able to obtain relationships between predictive atmospheric fields and observations according to the community’s physical understanding of severe weather forecasting. Dealing with a similar idea, McGovern et al. (2017) evaluates the performance of RF and Gradient Boosted Regression Trees in a problem of prediction skill for multiple types of high-impact events related to convective systems, such as severe wind, hail or heavy rain, with discussion on the impact of this severe weather in renewable energy or aviation turbulence. In Flora et al. (2021), three ML approaches RF, gradient-boosted trees, and logistic regression algorithms have been proposed to predict whether ensemble storm tracks will produce a tornado, severe hail, and/or severe wind report. The paper describes postprocessing using the ML algorithms of the ensemble output from the National Oceanic and Atmospheric Administration Warn-on-Forecast (WoF) project. The results obtained have shown that the ML-based postprocessing of WoF data improves short-term, storm-scale severe weather probabilistic guidance.

In Stubenrauch et al. (2023), ML techniques are used to improve the construction of an accurate 3D description of upper tropospheric cloud systems, in order to study the relation between convection and cirrus anvils. For this, different ANN models are trained on collocated radar-lidar data to obtain estimations of cloud top height, cloud vertical extent and cloud layering. ML methods are also used to estimate rain intensity classification in upper tropospheric cloud systems. In Shamekh et al. (2023), using a ML approach based on neural networks, it is shown that it is possible to discover the role of the organization of clouds on precipitation, and then include this information to improve precipitation prediction in climate models.

Finally, DL-based approaches have also been tested in prediction problems related to severe convective systems, such as in Zhou et al. (2019), where a CNN is introduced for severe convective weather prediction, including heavy rain, hail, convective gusts, and thunderstorms. The predictive variables are obtained from a numerical weather model (Global Forecasting System), and the performance of the CNN is compared to that of traditional methods and human expert evaluation of the data. The results showed that the CNN obtained results which improved the performance of previous algorithms and human expert results, but with some flaws such as too many false alarms in predicting hail and convective gusts. In Sobash et al. (2023), different DL-based approaches (DNN, CNN and CNN-Gaussian mixtures were used to probabilistically classify CAM storms into one of three different modes: supercells, quasi-linear convective systems, and disorganized convection. The storm mode classification is very useful to provide information about the hazard types of different convective systems.

3.4.2 Tropical cyclones

Other EEs associated with severe weather are tropical cyclones (TC). In addition to their extremely associated gusts, they always come with other severe weather events such as heavy rain, hail, or thunderstorms, in many occasions deriving in catastrophic events such as floods (Chen et al. 2020), storm surges (Xie et al. 2023), ground slides, etc. There is a very recent comprehensive review on ML approaches in TC forecast (Chen et al. 2020). That article covers previous works on ML for TC up to 2020. There have been some works dealing with topics related to ML for TC after that review paper. For example, there is some recent work dealing with ML for TC prediction and characterization, such as Baki et al. (2021) where a multivariate adaptive regression splines (MARS), has been applied to obtain the optimal values of the WRF mesoscale model parameterizations for TC prediction in the Bay of Bengal. In Tan et al. (2021), a gradient boosting decision tree model has been proposed for TC track forecast at Western North Pacific. A comparison with climatology and persistence is carried out to evaluate the performance of the proposed ML technique in this problem. In Sun et al. (2021), ensemble methods optimized by ML approaches such as Lasso optimization or Ridge regression are proposed to improve preseason prediction of Atlantic hurricane activity. In Pillay and Fitchett (2021), an analysis of the initialization variables affecting TC formation is carried out. RF algorithms are proposed to analyze the importance of each climate variable considered. The RF models are also used to predict intensification magnitudes of the TC based on the state of the input variables.

In Kar and Banerjee (2021), different ML algorithms have been applied to a problem of cloud intensity classification in TC over the Bay of Bengal and the Arabian Sea. Five ML classifiers have been proposed for this problem: Naïve Bayes, SVM, logistic model tree, random tree, and RF. The RF algorithm showed the best performance over the rest of the tested classifiers for this problem. In Kim et al. (2021), a decision-tree algorithm has been proposed for a problem of TC maximum lifetime intensity. The algorithm predicts the probability that a TC reaches a maximum intensity larger than 70 knots. Accurate results are obtained with classification rates over 90% in the considered test set. There have been some works dealing with the estimation of the precipitation produced by TC using ML techniques. In Zhu and Aguilera (2021), the RF method is applied to a problem of prediction of the precipitation associated with TC in Eastern Mexico. In Ngo et al. (2021), a hybrid Quantum PSO algorithm and a Credal Decision Tree (CDT) ensemble have been proposed for spatial prediction of the flash floods in TC. Experiments are carried out in the northwestern mountainous area of Vietnam. Satellite data from Sentinel-1 C-band SAR images are considered in this case to model the objective function. Finally, there are some recent works dealing with ML applications for evaluating the impacts of TC. In Nethery et al. (2021), ML algorithms, mainly Bayesian methods, are used to estimate health problems caused by TC. In Wendler-Bosco and Nicholson (2021), the economic impact of TC is analyzed by means of ML approaches, and in Zhang et al. (2021), the impact of typhoon Lekima on different Chinese forests is evaluated by means of RF over Landsat 8 OLI images. In Meng et al. (2023), different Gradient Boosting approaches have been proposed for probabilistic forecasting of TC intensity from different predictive variables such as sea surface temperature data, satellite bright temperature data, and data from other models and satellite-derived variables. Finally, in Ascenso et al. (2023), a ML framework based on evolutionary computation techniques (genetic algorithms Del Ser et al. (2019)) is applied to the optimization of TC genesis indexes. This approach is shown to obtain an index which captures the spatial and interannual variability of tropical cyclone genesis.

As in the case of other EEs applications, DL-based algorithms have been profusely used for TC prediction, mainly in the last few years. In Asthana et al. (2021), a CNN was used to predict Atlantic hurricane activity from reanalysis data. Accurate prediction results are reported, in comparison with alternative state-of-the-art models. In Farmanifard et al. (2023), a problem of TC trajectory prediction is tackled with a DL algorithm, formed by a hybrid MLP-LSTM approach. This approach was evaluated using the North Atlantic Ocean TC dataset, and input data such as wind speed, wind direction, and air pressure in the zone of study. Another work dealing with TC trajectory prediction was presented in Wang et al. (2023), where DL approaches (RNN, LSTM, and GRU) were applied to predict TCs trajectories in the northwestern Pacific in the Reanalysis period. In Zhuo and Tan (2023), DL algorithms were applied to a problem of TC size estimation from data infrared imagery in the Western North Pacific. The DL algorithms developed were then applied to a homogeneous satellite database to reconstruct a new historical dataset of TC sizes in the zone. In Chen et al. (2023), a study on rapid intensification of TC with DL-based algorithms (LSTM networks) is carried out. The results show that the LSTM network is able to improve the enhanced intensity and rapid intensification prediction performance in Western Pacific TC by using information from satellite images.

3.4.3 Hailstorms

Hail is an atmospheric EEs which causes important economic problems in many countries, mostly in agriculture and crop losses. Though it is not a frequent EEs (returning periods of severe hailstorms have been set around 20 years, depending on the zone, according to different studies (Fraile et al. 2003)) there are some works on prediction and characterization of this EE, including the use of ML techniques in the last years. Note, however, that prediction of hailfalls is a difficult task, due to the local spatial characteristic of this EEs and its short duration, which makes that prediction approaches should be developed separately for specific geographic areas.

One of the first works dealing with a prediction problem of hailfalls is López et al. (2007), in which the problem is tackled as a binary classification task (hail/no-hail). A logistic regression was then applied, obtaining a probability of Detection of 0.87 with a False Alarm Ratio of 0.18. After this initial work on hailstorm prediction, some more sophisticated ML methods were introduced. In Gagne et al. (2015), a hybrid approach mixing NWM with ML algorithms is proposed for a problem of hailfall forecasting. The NWM identifies potential hail storms and different ML algorithms mainly RF and gradient-boosting trees are used to predict hail occurrence. Observed hailstorms are used to obtain the ground truth values for this problem.

RF approaches have been recently applied to problems of hailfall prediction. In Gagne et al. (2017), a storm-based probabilistic hail forecasting is proposed, including an RF algorithm in the system. The prediction starts with an identification and tracking algorithm based on radar grid data and a convection-allowing model. Different parameters for characterizing the storm are then obtained and passed to the RF algorithm which has been previously trained with data from observed hailstorms. The RF algorithm uses this information to predict the probability of a storm producing hail, and also provide the hail size estimation. In Czernecki et al. (2019), a RF algorithm has been proposed for a problem of large hail prediction. Different predictive variables such as radar reflectivity, EUCLID lightning detection data, and convective indices from the ERA5 reanalysis are considered. The objective variables are obtained from observational data of large hail reports from Poland in the period 2008–2017. Also dealing with hail prediction using a RF algorithm, Yao et al. (2020) used hail observation data from 41 meteorological stations in the Shandong Peninsula, China, in the period 1998-2013 to train the algorithm. Different thermal factors and variables such as lifted index, Showalter stability index, and total index are used as predictive variables of hailfalls in this work. Another example of the use of RF in hail prediction is Burke et al. (2020), in which different observational datasets were used to train and test the RF approach, such as the maximum estimated size of hail (MESH), and the multi-radar multi-sensor (MRMS) product.

Finally, Some recent works have applied DL approaches to problems of hail prediction. In Pullman et al. (2019), a DL network has been applied to a problem of hailstorm detection. The GOES satellite imagery and MERRA-2 reanalysis data are used as predictive variables in this case. In Gagne et al. (2019), a CNN is applied to the problem of predicting the probability of severe hail (larger than 2.5 mm) in the next hour. Data for this study have been obtained from NCAR convection-allowing ensemble in May 2016. In Leinonen et al. (2023), a DL-based approach is presented for a problem of thunderstorm prediction, using multiple data sources such as data from weather radar, lightning detection, satellite visible/infrared imagery, numerical weather prediction, or digital elevation models. The DL model is able to predict lightning, heavy hail and precipitation probabilistically on a reduced spatial resolution (about 1 km) and with prediction time horizons between 5 min and 1 h. In Kolios (2023), a DNN model for hail detection is proposed. The input data consist of satellite (Meteosat) multispectral infrared (IR) imagery, exclusively. The DNN model was trained using numerous cases of hail events, as they were recorded from the European Severe Weather Database.

3.4.4 Extreme winds and gusts

Extreme wind gusts (EWG) are associated with severe weather. They can have catastrophic effects on crops and buildings and also have an impact on renewable energy facilities such as wind farms. A first review of techniques for WG prediction, including NWM and also ML approaches has been presented in Sheridan (2018). In Sallis et al. (2011), several ML algorithms have been applied to a problem of WG prediction. Logistic regression, MLPs and C4.5 classification trees and CART algorithms are tested in a problem of WG prediction at Kumeu, New Zealand. In Shanmuganathan and Sallis (2014), a similar problem was tackled, also in New Zealand. In this case, the study evaluates the performance of classification trees, MLPs and Self-Organizing Maps (SOM). In-situ measurements and data acquired between 2008 and 2012 at the Kumeu site, have been used for this study. In Lagerquist et al. (2017), a problem of extreme wind prediction in the surroundings of storm cells in the USA is carried out. The problem consists in calculating the probability of extreme winds over 50kt (25.7 m/s) in zones close to storm cells. The problem is formulated as a binary classification problem. The predictive variables considered in this case are based on radar measurements, storm motion and shape, and atmospheric soundings in the near-storm environment. Several ML models have been tested, including, logistic regression, RF, MLPs and Gradient boosting trees ensembles. In Wang et al. (2020), an ensemble model for WG prediction is presented. The proposed ensemble includes RF, a long-short-term memory (LSTM) algorithm and Gaussian processes for regression. A comparison against each model on their own, the persistence and a gradient-boosted decision tree showed the good performance of the ensemble method. Also dealing with ensemble models, in Schulz and Lerch (2021), a comprehensive review and comparison of eight ensemble methods based on ML for WG forecasting is carried out. The proposed algorithm is tested in 6 years of data from a high-resolution ensemble prediction system of the German weather service. In Spassiani and Mason (2021), a SOM is proposed to analyze the meteorological origin of WG in Australia. The SOM is used to establish the origin of the Application of Self-organizing Maps to classify the meteorological origin of WG into convective (from thunderstorms) and non-convective origin (synoptic), with different subclasses in each case.

In Arul et al. (2022), a RF approach is applied to the identification of extreme wind field characteristics and associated wind-induced load effects on structures, via the detection of thunderstorms. The idea is to use large databases containing high-frequency sampled continuous wind speed data and use the shapelet transform to identify individual attributes distinctive of extreme wind events. Experiments using real data from 14 Mediterranean ports, including sites in Italy, Spain and France are carried out.

In Peláez-Rodríguez et al. (2022), a hierarchical classifica-tion-regression ML approach is proposed for a problem of extreme wind prediction. The approach starts with the application of clustering algorithms and different balancing techniques to increase the significance of clusters with poorly represented wind gusts data. Then the classification of each sample into the corresponding cluster is carried out, and then, once we have determined the cluster a sample belongs to, a final regression level provides the prediction of the wind speed value. This approach has shown excellent results when enough data are available to train all the ML algorithms involved in the prediction system.

In Chkeir et al. (2023), DL-based approach based on a LSTM network is applied to a problem of extreme rain and wind speed nowcasting in the area of Malpensa airport, by merging different datasets from sensors in the local area of the airport. The results obtained showed extreme wind speed probability detection higher than 90%, with false alarms lower than 2% in this particular problem.

3.4.5 Analysis

The large majority of EEs related to severe weather are meteorological events, in which thermodynamic processes of the atmosphere play a central role. Depending on the EEs considered as severe weather, the period of return of the EEs is extremely high, such as damaging hailstorms, though other EEs classified as severe weather are much more frequent. Techniques to take into account the physics of these EEs in the ML are based on NWM (the ML algorithms are applied to the output of NWM) such as In Gagne et al. (2015), as the most effective method to consider the thermodynamic processes that characterize these EE, together with in-situ measurements, such as radar reflectivity or convective indices (Gagne et al. 2017; Czernecki et al. 2019). However, note that we have classified as severe weather different meteorological events, with specific peculiarities. For example, convective systems and hail storms are related events, quite local, in which thermodynamics and atmospheric state play an important role, very difficult to include as predictive variables in ML approaches. In extreme winds and gusts, however, the dynamics of the atmosphere may have significant importance to describe the phenomenon, and thus the synoptic situation provides information which may be exploited by ML algorithms (Spassiani and Mason 2021), in addition to other local atmospheric variables describing convective systems. It is also relevant the fact that in the last years, the number of DL-based techniques has increased a lot among the techniques applied to severe weather EEs, showing the research line which will be followed in future applications and problems related to EEs related to severe weather.

3.5 Fog and extreme low-visibility

Low-visibility EE, usually associated with fog formation (Gultepe et al. 2007) or turbidity in the atmosphere due to pollution, deeply affect transportation facilities such as airports (Cornejo-Bueno et al. 2020; Guerreiro et al. 2020) and roads (Peng et al. 2018; Wu et al. 2018). ML algorithms have been successfully applied in the last years to many fog and low-visibility prediction problems.

In Marzban et al. (2007), a hybrid approach involving MLPs and NWM (mesoscale model) is proposed for a problem of ceiling and visibility prediction in the USA. A total of 20 meteorological variables are considered as inputs for the MLP, obtaining a good visibility prediction in 39 measurement stations of the North-West of USA. In Fabbian et al. (2007), MLPs were tested in a problem of fog events prediction at Canberra International Airport (Australia), from meteorological observations. Data from the Australian Bureau of Meteorology were used to train and test the neural networks, obtaining promising results. In Miao et al. (2012), a fog prediction system formed by fuzzy logic-based predictors was proposed and analyzed at Perth Airport (Australia). The fuzzy logic predictor worked on the outputs of mesoscale numerical model (LAPS125) outputs, with the objective of refining the predictions obtained by the numerical model. This fog prediction model was operational at the airport and its outcomes averaged with the outcomes of two other fog forecasting methods by means of a majority voting approach.

In Colabone et al. (2015), the performance of MLPs with back-propagation training procedure in a fog event prediction problem at Academia da Força Aérea (Brasil) is analyzed. In Boneh et al. (2015), a Bayesian network is applied to a fog prediction problem at Melbourne Airport. In this case, the problem is tackled as a prediction time horizon of 8 h, and 34 years of data have been used to train the network. This fog prediction system has obtained better results than previous systems, becoming operational for fog prediction at Melbourne Airport. In Bartoková et al. (2015), a decision tree for short-time fog prediction in Dubai is presented. The decision tree is able to improve the results of mesoscale models such as WRF in short-term prediction time horizons of up to 6 h. In Cornejo-Bueno et al. (2017), different ML regression techniques have been tested over a fog prediction problem at Valladolid airport, Spain. In this case, radiation-type fog events are the most common in the zone, so the prediction problem is restricted to winter months. The authors reported successful results in event prediction by using support vector regression algorithms and extreme learning machines approaches. In Zhu et al. (2017), a deep neural network has been applied to a problem of low-visibility prediction at Urumqi airport, China. Meteorological variables measured at the airport between 2007 and 2016 are used to feed the deep neural network. In Durán-Rosal et al. (2018), evolutionary neural networks are considered for a problem of fog events classification from meteorological input variables. Several types of evolutionary neural networks are considered, by selecting different basic neuron types (sigmoidal, product and radial). A multi-objective training procedure is considered, obtaining good results in the fog event classification problem considered. In Guijo-Rubio et al. (2018), a problem of low-visibility events due to fog is tackled by applying ordinal classification methods. Three classes were considered (fog, mist and no-fog), and different ordinal classifiers were successfully tested in this problem of fog event prediction. In Dietz et al. (2019), decision trees models and tree-based ensemble with boosting are applied to a problem of very short-term prediction of low-visibility procedures states at Vienna airport, Austria. The work shows that for prediction time horizons under 1 h, the current low-visibility state (persistence), cloud ceiling, and horizontal visibility are the most important variables to take into account. For longer prediction time-horizons visibility information at the airport’s surroundings and meteorological variables become relevant.

In Bari and Ouagabi (2020), different ML algorithms (tree-based ensembles, feed-forward neural networks and generalized linear methods) have been applied to the output of a NWM (mesoscale model, WRF), for a problem of low-visibility prediction in Northern Morocco. In Li et al. (2020), a decision tree algorithm (C4.5 approach) has been applied to a problem of low-visibility prediction at Nanjing city. The work has shown that in this case, the variables related to humidity and particle concentrations (relative humidity, PM10 and PM2.5) are the most important factors to obtain accurate predictions of visibility at Nanjing. Finally, in Yu et al. (2021), a hybrid approach mixing Extreme Gradient boosted and NWM has been applied to a problem of visibility prediction in Shanghai, China. A large number of predictive variables are considered such as air pollutants concentration, meteorological observations, aerosol optical depth data and satellite images. The proposed hybrid approach provides a more accurate visibility forecast for prediction time horizons of 24 and 48 h than LGBM algorithms and NWM on its own.

In Cornejo-Bueno et al. (2020), the persistence and ML prediction of low-visibility events is studied in Valladolid airport, Spain. The performance of binary classifiers is evaluated in a problem of radiation for prediction in winter. In Cornejo-Bueno et al. (2021), a problem of low-visibility events prediction due to orographic forcing is analyzed with ML regressors at Lugo, Northwestern Spain. The work includes the statistical analysis of the low-visibility events in this zone. In Castillo-Botón et al. (2022), a thorough comparison of several ML algorithms in fog prediction problems is carried out. Both classification and regression techniques are analyzed, including balancing techniques and augmented data methods to improve the performance of ML in fog event prediction.

In close connection with low-visibility events, in this case, due to storms, in Ebrahimi-Khusfi et al. (2021), the number of dusty days is predicted with ML techniques in Northern Iran. SVR, RF and Stochastic Gradient boosting are the ML algorithms successfully applied to this problem. In Ding et al. (2022), the prediction of hourly low-visibility events is tackled in 47 Chinese airports, by means of different ML approaches such as MLP, RF, regression trees (CART) and KNN approaches, among others. The results obtained show important differences in performance from different airports, and also at different seasons (better performance in the cold season than in the warm season).

Finally, the application of DL-based techniques has been important recently. In Miao et al. (2020), a long-short term memory (LSTM) neural network has been applied to a problem of fog forecasting in the Anhui province, China. A comparison with K-Nearest Neighbours, AdaBoost and CNN algorithms has shown that the LSTM network is able to obtain better results. In Ortega et al. (2023), the performance of several DL models for visibility forecasting using time series climatological data are evaluated. Different DL models are considered, such as deep neural networks, CNNs and LSTMs. Results in data from two weather stations in Florida (USA) show a good performance of the DL algorithms. In Peláez-Rodríguez et al. (2023), several DL ensembles are discussed for a problem of low-visibility events prediction in Northern Spain (orographic fog). Recurrent neural networks, LSTM networks, Gated Recurrent Units and CNNs are the DL approaches considered in this ensemble approach. The performance of the ensemble was better than all the algorithms on their own, and it was also compared with alternative ML approaches, improving them in all cases. In Wang et al. (2022), a deep learning model implementing PCA and a deep belief network (DBN) is proposed for a problem of low-visibility events prediction. This approach was able to improve the results obtained by different ML and DL alternatives. In Zang et al. (2023), the RNN model is applied to a problem of low-visibility events prediction in Southern China. Comparisons with other DL-based algorithms including CNNs have shown a good performance of this DL-based method.

3.5.1 Analysis

ML analysis of fog events has been intense in the last few years. Fog formation may follow different physical mechanisms (Gultepe et al. 2007). For example, radiation fog, a typical fog of inland areas, usually occurs in winter under anticyclonic conditions, when clear skies and stability of the atmosphere allows the nocturnal radiative cooling required to saturate the air (Román-Cascón et al. 2012). On the other hand, advection fog occurs when moist, warm air passes over a colder surface and is cooled from below, producing an immediate condensation of water. This kind of fog is very common at sea when moist and unstable warm air moves over cooler waters. If the moist warm air moves up to a hill or slope, the air undergoes an adiabatic expansion which, in turn, cools down the air as it rises, allowing the moisture in it to condense and this way producing fog, usually called orographic or hill fog. Note that the dissipation mechanisms and persistence of these fog events are also different depending on the formation process (Cornejo-Bueno et al. 2020; Salcedo-Sanz et al. 2021). The inclusion of physics in ML approaches should take into account these formation and dissipation mechanisms, depending on the type of fog event considered. The best way of taking into account this is to consider as inputs meteorological variables related to fog formation or dissipation, as in the majority of cases has been done. Also, there have been some works which have used NWM as a previous step before the application of ML algorithms, as in Marzban et al. (2007); Bari and Ouagabi (2020); Yu et al. (2021). Finally, note that the application of DL-based approaches has been very notable in the last years, with different works discussing DL techniques and DL-based ensembles for visibility prediction problems (Ortega et al. 2023; Peláez-Rodríguez et al. 2023).

3.6 Final discussion

As reviewed in previous sections, a large amount of ML algorithms have been applied to a wide class of problems in EEs detection, prediction and attribution. EEs problems in different spatiotemporal scales have been tackled with ML algorithms. In some cases, long-term physical processes related to atmospheric dynamics seem to be predominant (heatwaves, extreme temperatures, droughts and floods in some cases), while in other cases, local short-term processes associated with thermodynamics are the predominant factor of the problem (convective systems, flash floods or extreme fog events).

Fig. 11
figure 11

Monthly averaged temperature in Paris, between 1950 and 2021. Mega-heatwave of 2003 is highlighted in the time series

We have broadly detected three types of approaches using ML in the literature reviewed, in all EEs problems considered in this work. First, there are articles in which ML algorithms have been applied raw, i.e. without any reference to the physics related to the problem. Usually, these works proposed approaches based on time series of measured values or involved some signal processing techniques, such as series decomposition, wavelets, etc. In general, these approaches have been exclusively compared against other alternative methods fully based on ML or autoregressive approaches such as ARIMA methods, and a poor discussion on the physical reasons for the good or bad performance of the algorithms is carried out. A second type of approach described in the literature reviewed is those works which try to take into account the physics of the problem through the input variables considered in the ML methods. Depending on the problem considered, certain input variables may consider physical aspects of the problems, such as atmospheric dynamics (synoptic situations, Rossby waves, climate indices and other variables related to atmospheric dynamics) or thermodynamics processes (convective or stability indices, and other variables related to thermodynamics process, usually from reanalysis data, satellites or direct measurements). Finally, the third type of works revised in this section are those ML approaches which present hybridization with physical or numerical models considering the physics of the problem, or those which present a coupling with physical models in order to improve their outputs. Different versions including hybridization/coupling with numerical models such as WRF, Analogue-based algorithms, and other NWM have been revised in this section. In general, these latter hybrid approaches were successfully compared with physical models and also with other ML approaches. In some cases, future projections based on CMIP6 models have been carried out from ML approaches, in attribution-related problems.

It is also remarkable the fact that different problem encodings and frameworks have been used in the EEs problems revised. Classification and regression frameworks have been used, depending on the specific EE, at very different spatiotemporal scales, from local to synoptic and global scales, at short-term and long-term temporal scales. The number of input variables in ML algorithms is an important issue in many of the approaches revised. In many cases, FS mechanisms are needed in order to improve the results of ML algorithms. In general, the articles reviewed reported successful ML applications to EE, but the comparison with alternative approaches can be biased. For example, those approaches in which physics processes are not taken into account in the ML, are not usually compared to alternative approaches including physical models, but only with other ML methods. In those works in which ML methods have been hybridized with NWM to include the physics of the EE, an improvement over the NWM has been reported. In many cases, this ML hybridization with NWM is focused on downscaling processes, in order to improve the spatial resolution of NWM, by using ML algorithms.

Finally, we have detected a clear increase of DL-based techniques in the last years, in all kinds of EEs detection and prediction and attribution problems. This trend is much more accused in 2020, and currently (2023) the large majority of works on EEs deal with DL-based techniques. It seems that this trend is unstoppable, due to the better results obtained with DL techniques, their flexibility and ease to work with spatiotemporal time series, better-covering problems in atmospheric EEs than traditional ML approaches.

4 Case study: summer temperature prediction with ML and DL approaches. Results and open problems

ML approaches devoted to characterising and predicting heatwaves and extreme temperatures have been previously discussed in this paper (Sect. 3.2). In this case study, different problem formulations are shown and discussed, also some results and issues related to summer temperature prediction, where heatwaves signals can be detected, based on reanalysis data for France. A final subsection shows an outlook, findings summary and open problems from this case study.

Fig. 12
figure 12

Anomaly of monthly averaged temperature in August between 2003 and the averaged temperature from 1950 to 2002

Fig. 13
figure 13

Synoptic regular grid similar to the one considered in this case study

4.1 August mean temperature prediction in France based on ML approaches and synoptic predictive variables from reanalysis

In this first problem definition, the prediction of August mean temperature by using ML approaches is addressed. In order to give a first definition of the problem, a specific case of August mean temperature prediction in central France is considered, where there have been extremely hard summer heatwaves in the last 20 years (García-Herrera et al. 2010; Ouzeau et al. 2016; Barriopedro et al. 2011). Let T(t) be an objective time series of air temperature (2 m temperature, for instance, or any other similar air temperature variable), obtained at a given point or averaged over a set of known points. In our case, T(t) stands for the mean temperature of a summer month (August) in the location of interest. Air temperature from ERA5 reanalysis data (Hersbach et al. 2020) has been considered in this case, as there are previous works which confirm that reanalysis data can be successfully used in the prediction of extreme temperatures (You et al. 2013). Fig. 11 shows the objective August mean temperature (2 m temperature) in the Paris area (France) from 1950 to 2021. Note that in some cases it is possible to spot heatwave signal in T(t), such as the mega-heatwave of August 2003 in Europe (Fig. 12).

Let \(V(t',\textbf{x})\) be the set of predictive variables, usually defined in a spatial regular grid \(\textbf{x}\), over time. Note that we have notated \(t'\) since it may not match with time t in T(t). In this problem, we consider a synoptic regular grid (Fig. 13), covering France, where we define a number of predictive variables to estimate T(t), also obtained from ERA5 reanalysis (Hersbach et al. 2020). Table 1 shows the predictive and target variables considered in this work.

Table 1 Predictive and target variables considered in this case study, as obtained by ECMWF (2022)

We consider the problem of predicting the mean temperature of August T(t) (regression problem), by using the value of the predictive variables in the previous months (\(t'\) stands for months of July/June, same year) in \(V(t',x)\). This approach is similar to that in Oettli et al. (2022), but focused on the summer temperature. Different ML and DL techniques among those described in Sect. 2 are considered to tackle this problem. Specifically, RF, DT, MLP, SVR, LSTM networks, and different dimensionality reduction techniques have been evaluated. We have also included a Linear Regression approach for comparison purposes.

Several research questions arise here: for instance, we want to assess whether there is enough information from variables in \(V(t',x)\) to obtain a good quality prediction of T(t) from ML approaches. Regarding extreme values, we aim to know if the model is able to obtain a prediction mechanism which shows a good quality prediction of extreme temperature values, with a prediction time-horizon of one month in advance. Also, the problem of obtaining the best set of features (dimensionality reduction) for the ML algorithms arises here. In order to solve these research questions, we will show different results and we will discuss different open problems found when dealing with this case study.

4.2 Experimental results and research issues

We have structured the results obtained in several subsections, where the results are discussed by considering different input variables from one single reanalysis node (local approach), results from several reanalysis nodes (synoptic approach), issues regarding the prediction problems, mainly the number of training samples available, and how to solve them by including new training samples with oversampling approaches. In addition, the feature selection method is shown, whilst a DL approach has been studied.

Fig. 14
figure 14

Mean August temperature prediction problem considered. Reanalysis nodes for predictive (blue) and objective variables (red)

Table 2 MAE and MSE of ML regressors considered in the temperature prediction problem by reanalysis spatial diversity

4.2.1 Input variables from one reanalysis node

A simple regression problem is addresed. In this case, a single node reanalysis field is considered foe extracting the predictor variables. The target node is considered, as above mentioned, in France. From the same point different predictor variables are considered, with the aim of predicting the target (August temperature). Figure 14 shows the considered node, in red. In order to tackle the problem, we first consider a training and test partition of the data. The available data is obtained from 1950 up to 2021. The period 1950–2002 is considered for training, whilst the period 2003–2021 is considered as the test set to evaluate the results. Note that, annual data is considered, thus, only 53 samples are available for training the algorithm, whilst 19 test samples where we can evaluate the skill of the model. Table 2 (first column) shows the MAE obtained by the different ML algorithms for this simple first case, and Fig. 15 details the predictions obtained by each ML algorithm. As can be seen, the prediction obtained by the ML algorithms is in general not fully accurate. It should be highlighted the different skills shown by the ML models. It can be observed that MLP is the worst approach in this case, with a MAE of 2.45. It is followed by DT, with a MAE of 1.88. In this problem, LR, RF and SVR show better than MLP and DT, with MAE values of 1.34, 1.43 and 1.52, respectively. It can be concluded that the database for training the algorithm is not large enough. It seems that further data is needed to reach a better performance of the models.

Fig. 15
figure 15

Temperature prediction considering variables from a single reanalysis node

4.2.2 Exploiting spatial diversity of reanalysis data to improve ML accuracy

The question that arises at this point is, can we generate additional training samples with the aim of improving the skill of the prediction model? A simple strategy is shown in this section. It allows for increasing the number of training samples by exploiting the spatial diversity of the reanalysis data.

Let us return to the problem tackled above, with a single reanalysis node considered, and 5 predictive variables. If we consider a local approach, note that there are a large number of reanalysis nodes in the neighbourhood of the selected one. In Fig. 14, we have set a number of neighbour reanalysis nodes in blue (81 nodes), around the red point. It is important to note that we have all the predictive (input) and objective (T) variables in all the points considered. Since we are in a local approach, we can assume a similar behaviour of the variables in the selected grid, in such a way that we can use all the variables in the grid as training samples. Surrounding grid points can be considered in the training data set. Thus, an oversampling approach is introduced (Torgo et al. 2015), by exploiting the diversity of reanalysis in a local approach. In this particular case, we finally obtain 4293 training samples (\(81 \times 53\)) instead of the initial 53 samples.

Table 3 and Fig. 16 show the new results when a reanalysis of spatial diversity is included to generate oversampling. As it is shown, a better performance of the prediction capability of the different ML models is obtained. In this scenario, the best improvement is for DT (MAE 1.88 \(\rightarrow \) MAE 1.02), which achieves an accurate prediction. The performance of the MLP model is also improved with the oversampling approach by using reanalysis spatial diversity (from MAE 2.45 to MAE 1.54), and the SVR is also improved in this case (from MAE 1.52 to MAE 1.36). The LR and RF do not improve their result when oversampling by reanalysis diversity is considered, but the performance deterioration is not very accused.

In this way, it is shown that the oversampling approach, by considering reanalysis of spatial diversity, is able to improve the performance of the ML regressors in the temperature prediction problem considered.

4.2.3 Extension to several input reanalysis nodes

Let us consider a second problem, with several reanalysis nodes to carry out the prediction of the heatwaves. We show a case with four reanalysis nodes in Fig. 17.

Table 3 MAE and MSE of ML regressors considered in the mean August temperature prediction problem in France, with and without oversampling by reanalysis spatial diversity

Note that, in this case, we consider 5 variables per node of reanalysis. The addition of a node implies 5 more predictive variables to the data set. Thus, a total of 20 predictive (input) variables are now considered in the problem, with 53 training samples in this case. It is expected that increasing the number of input variables with just 53 training samples does not lead to better results. Table 4 shows the results obtained with all the ML considered. As can be seen, the prediction of T(t), in general terms, the prediction skill of the models is not better than the case in which a single reanalysis point is considered.

Fig. 16
figure 16

Temperature prediction considering oversampling by reanalysis spatial diversity (one initial reanalysis node)

The oversampling can also be introduced when several reanalysis nodes are considered. For that purpose, the spatial diversity of the ERA5 data is exploited. The diversification points are shown in Fig. 17. Note that, for each reanalysis node, we can generate diversity by randomly selecting a neighbour node in each one. This way we can exploit the fact that the neighbour reanalysis nodes provide similar predictive variables or target values, and we can generate a large number of new training samples. Table 5 shows the results obtained by including oversampling by reanalysis of spatial diversification. As can be seen, the LR improves a lot its result, and the rest of ML algorithms seem to be slightly affected by diversification in this case, obtaining slightly worse results in general. Figure 18 shows the results obtained in the test set, which are, as can be seen, worse than those obtained by considering a single reanalysis node with oversampling.

4.2.4 ML-based oversampling and undersampling approaches

In ML, an oversampling procedure consists of increasing the number of observations by generating new data samples, in order to improve the performance of the training algorithms. In a classification problem, it is common the use of oversampling techniques in unbalanced data set problems or in small data sets. There are different oversampling techniques. For a classification task, the most commonly used algorithm is the SMOTE algorithm (Chawla et al. 2002). It creates new samples taking into account the statistics of existing ones, diminishing the risk of creating samples in “wrong” areas. For the regression problems, similar techniques can be encountered, such as the SMOGN algorithm (Branco et al. 2017).

In contrast, the undersampling methods decrease the number of samples. This technique is commonly used in problems with unbalanced data sets, with the aim of reducing the majority class.

Fig. 17
figure 17

Temperature prediction problem with input variables from 4 reanalysis nodes

In order to test ML-based oversampling approaches, it is addressed a case in which four reanalysis nodes are considered, Fig. 19. The SMOGN algorithm is used to generate ML-based oversampling in the problem. Tables 6 and 7 show the results when the oversampling with SMOGN is and is not considered. As can be seen, the performance of all tested regression models is improved by considering oversampling, but the LR, for which the results are worst when oversampling is considered, Fig. 20.

Table 4 MAE and MSE from ML regressors considered in the temperature prediction problem with four input reanalysis nodes
Fig. 18
figure 18

General case with diversification

The first DL-based approach consists of a combination of two different models. The first one is a VAE model. As has been explained above, this type of DL model infer from historical data, by using unlabelled data. In our case, the model is fed with the variables that may drive the event under study (extreme tempgeopotential height at 500 hPa (\(Z_{500}\)), the sea surface temperature (sst) and the \(t_{2m}\). Thus, the input data is composed of three channels, each per variable. The variables, periods and regions in which the variables are of considerations. In this scenario, we just focus on the model, but not on the selection of the variables, regions and lag times. Once the VAE model is trained, the encoder part of the model can be used for encoding the input data. The intermediate representation of the data in the VAE may have a lower dimension than the original data. Thus, a reduction in the dimensionality of the data is done. This latent space can be used as the input of the second model. In this case, a MLP is considered, Fig. 21. The prediction of the temperature is made by this model, which uses the latent space as the input data whilst the labelled target data (temperature) is for training. It is important to note that two different training processes are developed since the MLP model is not trained until the VAE has been trained. This approach is able to achieve significant results, which are comparable with persistence (operator \(x(t)=x(t-T)\)) and climatology (operator \(x(t)=\frac{1}{N}\sum _{j=1}^N x(t-j)\)) of the zone. Figure 22 shows the results obtained by the VAE-MLP compared to persistence and climatology. As can be seen, the proposed hybrid VAE-MLP is able to improve both persistence and climatology in this problem, obtaining more accurate results with respect to the ground truth (average weekly temperature). The differences between the VAE-MLP and the persistence are important, and the improvement is more significant for larger prediction-time horizons, as expected. The comparison with the climatology of the zone highlights fewer differences. In general, the VAE-MLP is able to improve the climatology in the cases is the smaller prediction time horizon (1 and 2 weeks in advance); however, in the cases of 3 and 4 weeks in advance prediction, the performance of the VAE-MLP is very similar to that of climatology, though still better than persistence.

Table 5 MAE and MSE of ML regressors considered in the mean August temperature prediction problem in France, with and without oversampling by reanalysis spatial diversity
Fig. 19
figure 19

Mean August temperature prediction problem in France, with input variables from 4 reanalysis nodes in the case in which SMOGN method is considered

Table 6 MAE and MSE of ML regressors considered in the temperature prediction problem without and with SMOGN technique applied on the database by reanalysis spatial diversity
Table 7 MAE and MSE of ML regressors considered in the mean August temperature prediction problem in France, without and with SMOGN technique applied on the database by reanalysis spatial diversity

There are other alternatives for the feature selection. For example, the wrapper approach (see Sect. 2.1) can be applied to estimate weekly temperature in France, using an evolutionary algorithm for the searching process, together with a fast-training ML approach (ELM), as described in Sect. 2.1. We can include different improvements in this scheme, by considering a previous spatial clustering in the problem. In this way, the evolutionary algorithm must select a variable from each cluster, including a further dimensionality reduction in the process. Figure 23 shows an example of this in the problem of temperature prediction in France. The coloured squares represent different zones which the algorithm must select variables from. This way it is possible to restrict the zones to different sizes (synoptic, global), so they describe dynamic processes of different temporal scales.

4.3 Case study outlook, findings, and open problems

In this case study, we have discussed the application of ML algorithms to a problem of mean temperature prediction in August from reanalysis data in France. We have defined the problem in this way in order to extend it to the prediction of a heatwave in France when smaller spatiotemporal scales are considered. In fact, even at a monthly scale, a heatwave signal can be detected in August mean temperature in some cases of meta-heatwaves, as that of 2003 in France (García-Herrera et al. 2010). In addition, we have observed the following issues from the application of the ML algorithms:

  • Since the problem definition involves annual samples (temperature in August), the training set has very small number of samples. This point, combined with the fact that we have a large grid with a large number of predictive variables on it, makes the training of the ML an extremely hard task.

  • The results obtained in a first problem considering a single reanalysis node, are far from accurate, due to the scarce number of training samples.

  • In order to improve the performance of the ML approaches, we propose to exploit the spatial diversity of the reanalysis data considered. First, we consider a fully local approach, including in the training set a number of neighbour reanalysis nodes to the objective node to generate new training samples. This oversampling approach generates new training samples, which allows a better training of the ML algorithms, improving the results obtained in the prediction of August mean temperature.

  • In a second attempt, we consider several reanalysis nodes to make the prediction and oversampling by exploiting local diversity in each node. The prediction obtained in these two cases by the ML algorithms is poorer than in the previous cases, since the number of predictive variables is increased, and much more training samples would be necessary to improve the results.

  • We have also shown the performance of ML algorithms in this problem, by considering ML-based oversampling by applying the SMOGN algorithm. SMOGN is especially suited for regression problems. In this problem of August mean temperature prediction in France the SMOGN works fine, producing oversampling which improves the performance of all ML algorithms versus the case without oversampling.

  • Thus, we have shown that considering oversampling to expand the training set is a good option in this prediction problem with a scarce number of data. We have proposed a novel oversampling approach by exploiting the spatial diversity of Reanalysis data, and we have also shown that ML-based oversampling also works in the problem.

  • We have finally given a note on the possible application of dimensionality reduction techniques, using a hybrid DL-based approach and a wrapper feature selection approach. We have shown that the hybrid DL-based algorithm formed by an AE with a MLP is able to improve the persistence and climatology of the zone when the prediction time horizon is up to 2 weeks in advance, and it works similarly to the climatology for 3 and 4 weeks in advance prediction time horizon. We have also outlined the introduction of a wrapper ML approach for feature selection in the problem, with a further dimensionality reduction using a previous spatial clustering. This approach gives the possibility of choosing different spatial scales for the predictive variables in the problem.

Fig. 20
figure 20

Temperature prediction by applying SMOGN oversampling method

Fig. 21
figure 21

VAE Structure and MLP

Fig. 22
figure 22

Error DL model (VAE + MLP) a prediction time-horizon 1 week; b prediction time-horizon 2 weeks; c prediction time-horizon 3 weeks; and d prediction time-horizon 4 weeks

Fig. 23
figure 23

Example of wrapper feature selection with previous spatial clustering for the problem of temperature prediction in France

There are several open problems in the prediction of annual mean air temperature from reanalysis data. We summarize them in the following points:

  • We have tackled local and synoptic versions of the problem from reanalysis data, with predictive variables back to just one month before. However, it is known that heatwaves detection (signal in mean monthly temperature) may have different drivers, some of them related to climate indices, which points out to a global definition of variables, with time-horizon for these predictive variables back to several months before. In this global definition of the problem, the management of the huge number of features involved will be extremely important to obtain significant results. Also, the generation of enough training samples for the ML algorithms is again a challenging aspect of the global version of the problem.

  • Note that there are different possible definitions of this prediction problem, depending on the data considered. We have shown an example with monthly temperature data, but quarterly, weekly or even daily time precision can be chosen and will also contain heatwave signals. It is also possible to directly use heatwaves indices (Awasthi et al. 2022; Nairn and Fawcett 2015) to define the problem, which have been proposed in the past, including other variables in addition to temperature.

  • In close relationship with the latter point, note that the problem can be tackled as a regression or classification problem. We have shown here a regression version, where the direct prediction of T(t) is tackled. In a classification problem, \(T(t) \rightarrow s[n]\), where \(s[n] \in \{0,1\}\) if we consider a binary classification problem (heatwave signal detected/no heatwave signal). This problem can be extended to a larger number of classes.

  • We have shown how the problem cannot be successfully tackled without considering the physics of the phenomenon. In other words, the ML approaches must be coupled with the physics of extreme temperatures, which act at different levels and considering different physical aspects of the problems, such as atmospheric dynamics and thermodynamics processes in order to improve the quality of the prediction.

  • There are also open problems related to the ML algorithms. As previously mentioned, it is clear that feature selection (see Sect. 2.1) is key for ML approaches to obtain significant results in the different versions of the problem. Due to the huge number of features involved in the problem, it is probable that wrapper methods on their own do not lead to good results, and a first feature discarding process based on filter approaches is needed. Other possible solutions such as using clustering approaches to reduce the number of features in some specific zones can also provide good results when the number of features is huge, such in the global approach to the problem. We have outlined this possibility in the experimental section of the case study.

  • Deep learning (DL) approaches could be used to tackle the problem without taken special care about the huge number of features involved. We have shown a possible DL approach using an AE hybridized with a ML, but other DL schemes are possible. Specifically, in this approach, DL algorithms could be useful to exploit global information and obtain an accurate prediction of heatwaves. Issues related to DL training, such as the number of training samples, and significance of the results obtained are the counterpart of this possible approach with DL algorithms.

5 Conclusions and perspectives for future research

5.1 Conclusions

In this paper, we have carried out a review of ML methods in the analysis, characterization, prediction and attribution of extreme atmospheric events (EEs). It is currently a hot topic, since EEs are increasing in the current situation of climate change, causing important damages to human and ecosystems. After a brief review of the main ML approaches which have been previously applied to EE-related problems, we have carried out a comprehensive and critical analysis of this topic in the literature, including the main EE, such as extreme rainfall and floods, heatwaves and extreme temperatures, droughts, fog and low-visibility events, and different topics related to severe weather (convective systems, tropical cyclones, hailstorms and extreme winds).

We have shown the application of several ML methods to a case study related to mean summer temperatures prediction in France, from reanalysis (ERA5) data. We have shown the main issues related to this problem using ML, including the scarce number of samples to train the ML approaches, the huge number of input variables and the different possible problem’s definitions (regression or classification tasks, prediction time-horizon considered, etc.). We have also shown that the inclusion of the physics is a key point in order to obtain good results for this problem, so it is necessary to couple the ML algorithms with some physical information for the problem in order to improve the results obtained.

Note that these issues associated with the case study considered in this paper can be extrapolated to other similar problems in extreme atmospheric events, which share similar data structure and scarce of events and data. We have also given some solutions to these issues for the case study considered, such as including oversampling techniques from reanalysis diversity, or even using different reanalysis data or global climate models to generate new training samples for the ML algorithms. These proposed solutions can also be applied to other problems related to extreme atmospheric events.

5.2 Perspectives

We also discuss here some final lessons learned, open problems and research possibilities and direction which are currently an option for dealing with EEs using ML algorithms, such as the use of XAI techniques, improving the attribution of EEs with ML techniques, and improving the study of concurrent and compound events, where the lack of data to train the ML algorithms is even more pressing.

  • One of the main issue when dealing with ML approaches to EEs prediction problems are the databases. Given the rarity of EEs, there are very few long-enough databases which provide reliable data for EE-Related studies. Even reanalysis data, with more than 70 years of data world-wide with high spatial accuracy may be not enough for some problems (the case study presented before is a good example of this). In theses cases, oversampling data may be of great help to improve the performance of ML algorithms. Note that only by considering two different reanalysis data (ERA5 and ERA20C, for example (Salcedo-Sanz et al. 2020)), we can duplicate the number of samples in the training set, by considering the output of each reanalysis in the same nodes. This opens the possibility to use climate models (with different parameterizations) to multiply the number of training samples available. Another interesting possibility is the application of different oversampling techniques to increase the number of training samples in a given database. In the case of reanalysis-type data, or data defined in a regular grid, oversampling can be carried out in a natural way by considering neighbor nodes, or with tailor-made techniques depending on the specific problem considered. Yet, the use of model-based data (either reanalysis or climate models’ simulations) could potentially limit the ability of ML methods of learning relationships outside the ones already implemented in the model. Moreover, training a ML algorithm on model-based data could overestimate the performance when tested against observational data as model-based data do not perfectly reproduce the real climatic conditions due to modelling errors and assumptions (Hoffmann et al. 2020; Matsuoka 2022). We, therefore, advocate making the most of observational data as they represent a richer ground truth, although sometimes characterized by low data quality and missing values. Here, however, ML can also contribute with advanced methodologies to reconstruct missing climate information (Kadow et al. 2020).

  • Another niche of opportunity in the characterization of EEs is the use of explainable AI techniques to gain an informed understanding of the correlations modelled by ML models (Arrieta et al. 2020). Indeed, a large fraction of the ML models used nowadays in this area relies on complex structures and processing units (e.g. deep neural networks) that achieve unrivalled levels of performance at the cost of opaque training and inference processes. This clashes with other models which, by virtue of their transparent internal structure or the way they are trained, elicit interpretable information about what features are relevant for the target at hand (e.g. tree-based bagging ensembles or linear regression). When this interpretability is not provided off-the-shelf, explanations can be generated ad-hoc for already trained models producing, as a result, visualizations, quantitative scores of predictive relevance or alternative what-if hypothesis for the model’s input, to mention a few (Montavon et al. 2018). This growing concern with explanatory techniques for ML models has spawned a whole area of research coined as eXplainable Artificial Intelligence (XAI), becoming a topic of central importance in applied machine learning in almost any discipline. Very recently, such techniques have started to be explored for extreme events prediction, as early as 2022. This is the case of van Straaten et al. (2022), where XAI was used to verify that a ML model learned to predict high summer temperatures from multiple predictors at different time scales agrees with the theoretical understanding of the underlying physical processes. However, there still prevail several challenges that, in our vision, should congregate the efforts of the community in years to come. Among them, we highlight two differential research directions:

    1. 1.

      The need for stepping beyond correlation-based ML towards data-based causality inference (Peters et al. 2017). Since the goal of decision-making is to avoid – or at least, minimize the consequences of – extreme events, data-based models should guarantee the actionability of the model’s input to steer the predicted output in one direction or another. Such interventional tools are being actively investigated nowadays in the context of ML, with models ensuring input–output causality still far from their maturity (see Runge et al. (2019) and references therein) because they often require the introduction of several assumptions (e.g. Gaussian distributions) that might be violated by the processes associated with EEs. At the same time, fewer assumptions are required for identifying the absence of a causal link (Runge 2018), making the findings of non-causality already quite robust in determining when it is unlikely that a cause-effect physical mechanism exists. We expect the use of data-based causality inference to become more and more attractive for supporting the trustability of black-box ML models (Reichstein et al. 2019).

    2. 2.

      The inherent uncertainty of the physical world and the atmosphere propagates to the output of the models devised to characterize extreme phenomena occurring therein. Thereby, a remarkable corpus of literature has striven towards quantifying the confidence of the model in its output considering the modelling (epistemic) uncertainty and the irreducible (aleatoric) uncertainty. While confidence analysis is a well-established area in ML research, the combination of confidence and explainability in a single framework is still to be seen. Indeed, explanations of uncertain models make no practical sense, nor do models that are certain about their predictions but do not explain what they model in the data at hand. The variability and incompleteness of atmospheric data, and the large epistemic uncertainty of deep learning models can, without no doubt, leverage advances such as evidential DL, variational neural networks or model-agnostic techniques such as conformal prediction. Confidence estimations provided by these techniques should be considered when furnishing explanations.

    On a summarizing note, we advocate for a focus of the research community steered towards modelling aspects that complement the derivation of more models and performance comparison studies. In other words, we advocate for ML approaches at the end of a pipeline driven by physics, in this review, we have shown very different examples which show that the application of ML techniques without including the physical basis of the problem does not lead to relevant results in the majority of cases.

  • Improving attribution of EEs using ML. There are not many works dealing with the attribution of EEs using ML techniques. In this work, we have discussed some works dealing with the attribution of EEs using ML techniques for specific events of heatwaves (Pasini et al. 2017; Zaninelli et al. 2023) and droughts (Richman and Leslie 2018, 2020). There are some recent works dealing with ML in general climate attribution problems (Mamalakis et al. 2022; Trifunov et al. 2021), and also on specific attribution of forced climate change signals over atmospheric fields such as global temperature or precipitation (Barnes et al. 2019, 2020; Hartigan et al. 2020a, b). In Callaghan et al. (2021), a large study on the attribution of climate impacts with ML methods has been recently carried out. However, it is necessary to extend these works to better cope with the attribution of EEs by using ML approaches. The application of novel ML/DL approaches specifically to attribution problems is another line to follow in the years to come. The study of causal inference with ML (Schölkopf 2022) is also a topic fully related to attribution, in which there are some recent works focused on extreme atmospheric events (Nethery et al. 2021; Liu et al. 2021).

  • ML for concurrent and compound EEs. The concept of concurrent event refers to (atmospheric) EEs of different types occurring within a specific temporal lag, either in different locations or at the same one. This concept can also be used for extremes of the same type occurring in two locations within a specific period (Toreti et al. 2019). On the other hand, compound events refer to concomitant (within a given temporal lag) occurrence of events (extremes or not) with severe and harmful consequences of socio-economic relevance. It is possible to see that concurrent events are a subset of compound events. In spite of the work on concurrent and compound events has been intense in the last years (Bresch et al. 2018; Zscheischler et al. 2020; White et al. 2021; Markonis et al. 2021), the application of ML techniques to prediction or attribution of concurrent or compound events has been minor. There is a very recent work discussing ML techniques applicable to compound events together with statistical and numerical techniques (Zhang et al. 2021), and some white papers and technical reports on the topic (Feng et al. 2021), but in general the application of ML to this topic is an open problem. The most important issue with ML approaches in concurrent and compound EEs is related to the lack of available data to study these types of situations. There have been some intents to generate databases for concurrent and compound events (Feng et al. 2020), but in general further efforts are needed to strengthen this topic, so ML methods can be successfully applied in this area.