1 Introduction

Extreme precipitation events can cause significant damage, especially in densely populated regions (Jonkman 2005). An example was in July 2011, when a cloudburst hit the capital of Denmark, Copenhagen, leading to widespread flooding and substantial harm to society, the environment, and the economy (Ziersen et al. 2017). Extreme precipitation events are caused by specific atmospheric processes and can be correlated to both local and large-scale meteorological drivers. Knowledge about these mechanisms will improve our understanding of the local physical climate and our ability to infer under what conditions extreme precipitation events occur and thus enable better predictions of occurrences in the future. Numerous studies have related extreme rainfall to meteorological drivers. North Atlantic Oscillation (NAO) was linked to the variability of the Mediterranean precipitation (Xoplaki et al. 2004), to the occurrence and intensity of extreme precipitation events over northeast Spain (Vicente‐Serrano et al. 2009), and to the change in the European winter precipitation and other extremes (Scaife et al. 2008). The East Atlantic Pattern (EA) was related to the convective anomalies in the tropical Atlantic (Maidens et al. 2021) and to the spatial and temporal changes in the frequency of extreme rainfall events over Denmark (Gregersen et al. 2013a). Extreme precipitation events have been characterized using Convective Available Potential Energy (CAPE) and Dew-Point temperature in the eastern United States (Lepore et al. 2015) and South-Central Andes (Ramezani Ziarani et al. 2019). Furthermore, previous research has related extreme rainfall events with humidity-related variables over the Mediterranean (Hertig et al. 2014; Hertig and Jacobeit 2013) and sea surface temperature (SST) in tropical regions (Dittus et al. 2018).

Many different methods can be used to develop models that explore these relationships. Classic statistical models such as linear (Li and Wang 2018) or logistic (Chan et al. 2018) regression (LR) are commonly applied depending on whether the outcomes are continuous or binary. Regression models are based on theory and explicit assumptions and benefit from domain knowledge for model specification providing a clear framework for understanding the relationships between explanatory variables (Hastie et al. 2009).

Recently the use of machine learning (ML) algorithms is becoming more widespread as an alternative approach for classification and prediction. ML is a subfield of artificial intelligence based on non-linear algorithms adapting and learning from data (Mitchell 1997). These algorithms can process vast amounts of multidimensional data such as reanalysis, satellite, or radar data. ML has been used in predicting extreme rainfall intensities (Davenport and Diffenbaugh 2021; Lee et al. 2012) and for rain/no-rain classification (Liu et al. 2001; Shi 2020).

The differences between ML and classic regression have been extensively explored in the literature (Breiman 2001a). For example, ML automatically includes non-linear associations and interaction terms, whereas for regression methods, they must be specified by the user (Boulesteix and Schmid 2014). Because of this adaptability, ML is claimed to offer superior predictive performance relative to traditional statistical modeling and better handling of a greater number of explanatory variables (Deo and Nallamothu 2016). However, as a downside of this flexibility, ML algorithms tend to overfit the data used for training, which must be compensated for by penalization of the complexity of identified models. In scenarios where data is limited, feature engineering, and feature selection becomes even more important when using ML models to ensure optimal performance and mitigate the risk of overfitting (Chen et al. 2020; Guth and Sapsis 2019).

In this study, the term “prediction” refers specifically to the outcomes generated by statistical and ML models, rather than a forecast of the future state of variable (such as a weather forecast).

Although many studies in other fields (e.g., health sciences) compare the performance of classic statistical models to different ML algorithms, there are only a few within climate sciences. Wei et al. (2020) showed that a decision tree performs better than LR for extreme rainfall event classification. Meyer et al. (2016) proved that a Neural Network (NNET) is a more suitable algorithm for satellite-based rainfall retrievals than Random Forest (RF) and Support Vector Machine (SVM), but LR was not part of the comparison. Lastly, Moon et al. (2019) suggested using LR as an effective early warning system for very short-term heavy rainfall in South Korea instead of ML models.

Even the most sophisticated machine learning algorithms rely on the quality of data. In case of low-quality input data, the reliability of the results will be compromised (Budach et al. 2022). Extreme events are very localized, so choosing a densely monitored study area with rain gauges, such as Copenhagen (Thomassen et al. 2022), is crucial for accurately capturing these events.

The novelty of the present work is two-fold: (i) To identify relationships between meteorological explanatory variables selected based on a priori domain knowledge and extreme events in the densely monitored Copenhagen area. This includes considering both variables previously linked to extreme precipitation events in various regions around the globe and local variables that capture the unique characteristics of the Copenhagen area. By examining the importance of these two types of variables, we aim that our findings can gain valuable insights into the specific drivers of extreme precipitation occurrences in this region and serve as a foundation for extending our analysis to other regions. (ii) To systematically compare the explanatory performance of classification models developed using traditional LR and three ML models and assess the similarities and differences of the most influencing explanatory variables between models. The hypothesis is that traditional LR would result in the lowest performance. In summary, we address the following research questions:

  1. 1.

    Which meteorological drivers can explain local extreme rainfall events in a densely monitored region?

  2. 2.

    What is the relative importance of the drivers across different statistical models?

  3. 3.

    Do ML models lead to improved performance compared to traditional statistical modeling?

2 Data and study area

2.1 Precipitation

Extreme precipitation events are often very localized (< 10 km horizontal scale), and a dense network of rain gauges is therefore needed to record extreme events when they occur. At the same time, to link explanatory variables with the occurrence of extreme precipitation events in a robust way requires long time series of observations. In this study, the precipitation data set consists of hourly observations for the period 1979–2020 from 15 gauges located in the Copenhagen area. The geographical location of the stations appears in Fig. 1.

Fig. 1
figure 1

Overview of the geographical data domains of the explanatory variables and the location of the rain gauge stations in the subplot. There are a total of fifteen stations in an area of 525.50 km2 (12–12.65 W and 55.57–55.84 N)

The gauges are part of a national network run by the Water Pollution Committee of The Society of Danish Engineers and have a measurement resolution of 0.2 mm (Gregersen et al. 2013a). The gauges have an average uptime of 95.8%, and data has been quality controlled by the Danish Meteorological Institute (DMI). This study only considers the five months of May to September each year, which define the main season of convective rainfall extremes in Denmark (Pedersen et al. 2012).

2.2 Extreme precipitation explanatory variables

This study employs explanatory variables that relate to the physics of convective rainfall events and have been identified in previous research as influential factors for extreme precipitation events. Additionally, local predictors that capture the unique characteristics of the study area are also investigated. The chosen meteorological variables from various sources utilized as explanatory variables over different spatial domains (Fig. 1) are:

  1. 1.

    Observed Surface air temperature (SAT) data for 1979–2020 from one station in the middle of the island Zealand. The temporal resolution changes over time, from 3-hourly observations in the first years to hourly observations in at least the last 20 years. The choice of this variable is motivated by the experience of local meteorologists that extreme precipitation events occur on days with high inland afternoon temperatures when convection may be released. The data were quality controlled by DMI. More information about the observation protocols can be found in the supplementary material.

  2. 2.

    The daily-mean sea surface temperature (SST) for the North Sea and Baltic Sea from the Copernicus Marine Service Information (DMI 2015) for the period 1982–2020. This is motivated by the idea that SSTs could influence convection if the air mass passes over the sea.

  3. 3.

    The rest of the extreme precipitation explanatory variables have been extracted from ERA5, the fifth-generation global reanalysis product by European Centre for Medium-range Weather Forecasts (ECMWF) (Hersbach et al. 2020) for the period 1979–2020. The data are generated at hourly time steps with a spatial resolution of roughly 30 km × 30 km. The selected explanatory variables are:

    • Geopotential height at 500 hPa (Z500), which determines flow strength and direction at 500 hPa, which is usually considered the steering level for meso-scale weather systems.

    • Convective available potential energy (CAPE) as a measure of atmospheric instability and the potential for convection due to the vertical temperature gradients and humidity in combination (further details in supplementary material).

    • Total Column Water (TCW) as a measure of the maximum amount of water available for precipitation in case of strong convection (further details in supplementary material).

3 Methodology

3.1 Preprocessing data

3.1.1 Definition of an extreme precipitation event

To develop a classification model for predicting extreme precipitation events, class labels first need to be generated. To ensure the independence of events for each station, an 11-h dry period is required between events (Thomassen et al. 2022). For each station, the most extreme events based on the maximum hourly intensity are sampled using a Peak Over Threshold method (POT) with type II censoring (Coles 2001), resulting in 3 events per year on average (Gregersen et al. 2013b). The date of the event was the date when the maximum intensity was observed. Any day with an extreme recorded in at least one station is labeled as an extreme day. This led to a total of, on average, thirteen days with extremes per year for the region.

3.1.2 Explanatory variables

The purpose of including Z500 in this analysis is to examine whether specific patterns in the large-scale, regional atmospheric flow fields are associated with extreme precipitation events in the case area. The daily climatology of Z500 is determined in each grid point by first applying a 5-day centered moving window and then calculating the simple average of all values corresponding to each day of the year. Calculating the climatology of variables with a 5-day smoothing window is a common technique used in climatological studies to reduce the impact of day-to-day variability and reveal long-term patterns. Then anomalies are derived by subtracting the daily climatology from each corresponding day in the selected period, which leads to a regional 2D field of Z500 anomalies for each day. Including Z500 anomaly values from all grid cells in the 2D field would be a huge increase in the number of input predictors for the models. It is therefore common practice in studies like this one to perform Principal Component Analysis (PCA) on the Z500 anomalies as a dimensionality reduction technique (Mastrantonas et al. 2021; Merino et al. 2019; Storch and Zwiers 1984). This way of processing Z500 fields is also referred to as EOF (Empirical Orthogonal Functions) analysis in the field of climatology (Sun and Wang 2018; Yang et al. 2013). PCA enables the representation of the data in a lower-dimensional space, with significantly fewer variables than the original dataset while ensuring independence among these variables. We weigh the data by the square root of the cosine of the latitude to provide equal-area weighting in the covariance matrix. We retain the first Principal Components (PCs), which explain at least 90% of the total variability in the anomaly field. Combining with a scree plot as a graphical representation can provide an adequate validation of the 90% criterion. During exploratory analysis, we also analyzed larger domains than shown in Fig. 1 (i.e., North Atlantic), but this only resulted in weaker ability to explain occurrences of extreme events. Moreover, we followed the domain size recommendations by Chen and Wang (2014) to optimize the effectiveness of PCA.

We then performed K-means clustering as suggested by (Mastrantonas et al. 2021) on the retained PCs, but the performance of the classification models was not improved compared to just including the retained principal components, so for all the analysis hereafter, we use the retained PCs. See Supplementary Material for details.

For CAPE and TCW we are interested in the daily means at the location of the case study, and therefore the data were bilinearly interpolated from the original ERA5 grid cells to the exact geographical position of Copenhagen. For SAT, we extracted the daily maximum (SATmax), the daily difference between the maximum and the minimum (SATamplitude), and the daily maximum of the previous day (SATlag1). Lastly, for SST, the mean, the maximum, and the previous day's maximum value of the North Sea and Baltic Sea (Fig. 1) were tested as individual explanatory variables and as their combinations. An overview of the preprocessing methods can be found in Table 1.

Table 1 Overview of the preprocessing methods for each explanatory variable

3.2 Classification models

We use four algorithms that can classify a discrete outcome based on continuous input: LR as a traditional statistical modeling approach and RF, NNET, and SVM as ML algorithms. All ML algorithms have one or more hyperparameters that control how well the model fits the data, and the optimal values for these parameters can vary from dataset to dataset. We do not impose pre-determined interactions on the ML models to allow them to leverage their inherent capabilities in handling variable interactions. This approach enables fair comparisons and allows each model to utilize variables and their interactions in the manner that best complements its underlying algorithm.

3.2.1 Logistic regression models

LR analysis is the most frequently used modeling approach for analyzing dichotomous response variables (i.e., occurrence or non-occurrence of an extreme event). It belongs to the family of generalized linear models (GLM). In a GLM, the three building blocks are (Lindsey 2000): a random component, a systematic component, and a link function. In LR, the random component \(Y\) follows a binomial distribution and can be represented by the model:

$$G\left( {E\left( Y \right)} \right) = X\beta^{T} + e$$
(1)

where \(G\) is a link function, \(X\) is the design matrix of n systematic components (explanatory variables), \(\beta^{T}\) is a vector of coefficients, and \(e\) is the residual vector term.

In this study, we use the logit link function (Cox et al. 1999):

$$G\left( \pi \right) = \ln \left( {\frac{\pi }{1 - \pi }} \right)$$
(2)

which monotonically maps the domain (−∞, +∞) to (0,1) where \(\pi\) is the probability of an extreme event and \(\frac{\pi }{ 1-\pi }\) is the corresponding odds of an event being extreme. The estimate of \(\pi\) is then given by:

$$\pi = \frac{{\exp \left( {X\beta^{T} } \right)}}{{1 + \exp \left( {X\beta^{T} } \right)}}$$
(3)

The coefficients \(\beta\)i, \(i=1,\dots ,p\) are estimated using the maximum likelihood method (ML). The significance of the individual coefficients is assessed with the Wald statistic, which is simply the ratio:

$$Z = \frac{{\hat{\beta }_{i} }}{{s.e.\left( {\hat{\beta }_{i} } \right)}}$$
(4)

which can be used to test if \(\beta_{i} = 0\). The standard normal distribution is used to determine the p value of the test, and the confidence intervals are given by:

$$\hat{\beta }_{i} \pm z_{1 - a/2} s.e.\left( {\hat{\beta }_{i} } \right)$$
(5)

In addition to estimating the probability of an extreme event, LR also provides a measure of association between the response variable and the explanatory variables in the form of odds ratios. The odds ratio represents the change in the odds of the response for a one-unit increase in the explanatory variable, holding all other variables constant. In LR, the odds ratio is calculated as the exponentiated coefficient of the explanatory variable. An odds ratio with a value equal to one indicates no association between the explanatory variable and the response, while a value higher/lower than one indicates a positive/negative association.

Understanding the strength of the relationships between the explanatory variables in a regression model is crucial since it may affect the reliability of the model's estimated coefficients. Variance Inflation Factors (VIF) is one way to assess the potency of these connections (VIF). The VIF (O’brien 2007) is a statistical measure that quantifies the degree to which the variance of the estimated regression coefficient \(\beta_{i}\), increases due to the presence of multicollinearity among the explanatory variables in the model. The calculation of VIF involves regressing each explanatory variable on all the other variables in the model, and the VIF is for the ith explanatory variable is determined as:

$$VIF_{i} = \frac{1}{{1 - R_{i}^{2} }}$$
(6)

where \(R_{i}^{2}\) is the R2-value obtained by regressing the ith explanatory variable on the remaining variables. A VIF value of 1 indicates no collinearity, while values greater than 1 indicate an increasing degree of multicollinearity.

Since we are analyzing time series data, it is important to examine the possibility of serial correlation in the regression residuals, which means the deviation from the expected characteristics of white noise residuals in LR. Serial correlation does not prevent precise predictions of the response within the model’s scope, but the standard errors of the estimated odds ratios will, in general, be underestimated. To evaluate the impact of serial correlation, we examine the significance of the regression odds ratios for the model fitted to all data versus to a thinned data set that sub-samples every third and every fifth day, respectively. Thinning the data this way reduces the serial correlation in the response and, therefore, likely also in the regression residuals. By doing so, we aim to examine the robustness of our results to different levels of serial correlation. The choice of this thinning scheme is based on the partial autocorrelation function (See Supplementary Material).

Lastly we did experiments with modeling the explanatory variables with restricted cubic splines (Gauthier et al. 2020) to address potential nonlinearity, but the use of splinesdid not lead to any significant increase in the performance of the final model.

3.2.2 Random forest

Since its introduction in 2001 (Breiman 2001b), the RF supervised machine learning algorithm for classification has seen a significant increase in popularity. RF is an ensemble learning method consisting of multiple decision trees that result in reduced variance compared to single decision trees (James et al. 2021). Each of the trees fit in a different bootstrap sample of the original dataset (bagging) to increase the diversity in the decision trees. To further decorrelate the decision trees, RF randomly re-samples the explanatory variables at each split. In contrast to LR, RF is not restricted by the assumption of independence between explanatory variables, captures the interactions between explanatory variables without specifying them, and can model complicated non-linear effects. One of the primary hyperparameters is the number of trees. It is often regarded as best practice to grow as many trees as computationally possible.

We use the Gini impurity (Nembrini et al. 2018) to determine how the explanatory variables of the dataset should optimally split nodes when training a decision tree:

$$Gini = 1 - \mathop \sum \limits_{r = 1}^{2} \left( {\pi_{r} } \right)^{2}$$
(7)

where r represents the index of the classes in the dataset.

3.2.3 Support vector machine

A SVM is a supervised ML model for classification (Cristianini and Shawe-Taylor 2000). The objective of SVM is to find a hyperplane in the n-dimensional space (n = the number of explanatory variables) that distinctly classifies the data. The hyperplane is defined as the set of points x satisfying:

$$w \cdot {\text{x}} + {\text{b}} = 0$$
(8)

The vector w and scalar \(b\) for the best hyperplane are determined by an optimization procedure that maximizes the margin between two classes in the n-dimensional space.

We use a soft margin technique that allows for a number of misclassified cases. This number is controlled by a hyperparameter (cost) which imposes a penalty on the model for making an error. Moreover, since the classification problem is non-linear, we use a kernel function that returns the dot product of the transformed vectors in the higher dimensional space. We used the Gaussian radial basis function

$$G\left( {x_{i} ,x_{j} } \right) = \exp \left( { - \frac{{\left| {\left| {x_{i} - x_{j} } \right|} \right|^{2} }}{{2\sigma^{2} }}} \right)$$
(9)

where \(\sigma\) is the bandwidth of the kernel function hyperparameter and \({x}_{i}\) and \({x}_{j}\) represent two different data points from the dataset.

The SVM model performs poorly on imbalanced datasets (Fernández 2018), so we use an extension to the algorithm in order to increase its performance: we create weights for our data samples such that each sample is weighted according to its corresponding class (extreme/non-extreme) size. Samples of bigger classes will be assigned smaller weights and vice versa.

3.2.4 Feed-Forward neural network

NNET is a family of ML algorithms that use one or more layers of nodes (also known as neurons) coupled by non-linear functions with adjustable parameters (weights) to map inputs to outputs (Bishop 2006). Network training aims to find a set of weights that minimizes the difference between the NNET output and the training labels. The loss function in classification models is usually cross-entropy (Eq. 4.90 in Bishop 2006). The loss function is then minimized using one of the gradient descent techniques, whereby a step is made in the direction of the steepest fall (the negative of the gradient), with the size of the step regulated by the learning rate. However, even though NNETs may approximate any smooth non-linear function and allow for collinearity among the explanatory variables, they are challenging to understand and can also be computationally expensive to train.

3.3 Model tuning and validation

The data set is divided into training/validation data (80%) and testing data (20%). Within the training/validation subset, a tenfold cross-validation technique is applied to select the best model (Kohavi 1995). Training data are used for model training and fitting internal model parameters, while validation data are used for tuning model hyperparameters and variable selection. In the tenfold cross-validation process, the training/validation data are randomly partitioned into ten folds of approximately equal number of years. Following that approach, there are samples with specific physical properties for generating extreme precipitation events in all subsets. At each iteration, one of the folds is chosen as the validation set, while the remaining folds are used for the training set. The model is fitted in the training set, and the performance metric is computed based on the validation set. Finally, the performances are averaged across the ten iterations, and for each algorithm, the model that performs best across all the iterations is selected as the final model. This final model is then evaluated on the independent testing data, which was initially set aside.

3.4 Performance metrics

Given the rarity of extreme events, the data are characterized by a very low fraction of days with extremes. In light of this, the dataset is imbalanced, so using a traditional performance metric such as accuracy \((ACC)\) could be misleading since it can be maximized by simply predicting the majority class and thus omitting the minority class (Liu et al. 2009). \(ACC\) is defined as:

$${{ACC}} = { }\frac{{{{TP}} + {{TN}}}}{{{{TP}} + {{TN}} + {{FP}} + {{FN}}}}$$
(10)

where \(TP\) = True Positives, \(TN\) = True Negatives, \(FP\) = False Positives, and \(TN\) = False Negatives, which are the four possible outcomes of binary predictions.

The receiver operating characteristic curve (ROC) (Mason and Graham 2002) is a commonly used validation tool for classification problems and consists of a two-dimensional graph that shows the true positive rate \((TPR)\):

$${\text{TPR}} = { }\frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}$$
(11)

versus the false positive rate \((FPR)\):

$${\text{FPR}} = 1 - { }\frac{{{\text{TN}}}}{{{\text{TN}} + {\text{FP}}}}$$
(12)

for all possible probability thresholds.

A random model will produce the diagonal line as its ROC curve, and a perfect model will have a ROC curve composed of the left and upper boundary lines. The "steepness" of ROC curves is hence important since maximizing the benefits \((TPR)\) and minimizing the costs \((FPR)\) is ideal. ROC AUC measures the area underneath the entire ROC curve and is the measure of the ability of a model to distinguish between two classes. It provides the total performance measure across all potential classification thresholds and varies between zero and one. The higher the ROC AUC, the better the model is at making classifications. The Mann–Whitney U test (Wilks 2011) is applied to estimate whether the model performs statistically better than a random model in terms of ROC AUC. The DeLong test (DeLong et al. 1988) was used to make pairwise comparisons in ROC AUC between LR and ML models. Statistical significance is defined at \(\alpha\)= 0.05.

On the other hand, the area under the Precision-Recall (PR) curve (Saito and Rehmsmeier 2015) summarizes the trade-off between precision and \(TPR\) at all possible probability thresholds, taking into account the imbalance in the dataset. Precision is the fraction of \(TP\) predictions among the positive predictions made by the model:

$$Precision = \frac{TP}{{TP + FP}}$$
(13)

In imbalanced datasets, it is possible for a classifier to achieve a high ROC AUC by making a large number of false positive predictions, especially when the positive class is rare. In this case, the PR AUC will be much lower, reflecting the low quality of the positive predictions. Therefore, it is important to consider both the ROC AUC and the PR AUC when evaluating the performance of a binary classifier in imbalanced datasets, as they capture different aspects of the classifier's behavior (Davis and Goadrich 2006).

To assess if the models are well-calibrated, we use the Brier score. The score is given by:

$$Brier\,score = \mathop \sum \limits_{t = 1}^{n} \left( {y_{t} - o_{t} } \right)^{2}$$
(14)

where \({y}_{t}\) is the predicted value,\({o}_{t}\) is the true value, and n is the number of observations. The Brier score also has shortcomings for imbalanced datasets (Benedetti 2010), but it can be used as a relative measure for model comparison.

We will use ROC AUC as the primary measure of model performance, and the secondary measures of performance are the PR AUC, the Brier score, and \(ACC\) together with \(TPR\) and \(FPR\) at the optimal threshold, which is the threshold that maximizes the sum of \(TPR\) and \(1- FPR\).

In this study the variable selection on ML models is done in same way as with the regression model. This decision has been made to establish a consistent and comparable framework for evaluating the different modeling approaches. By using the same set of variables across all models, we could directly compare the performance of the ML models against the LR model, highlighting the potential benefits of ML methods in terms of capturing complex interactions.

3.5 Variable importance

The contribution of the different explanatory variables to the overall performance was quantified for all models. For LR, we study the variable importance of the explanatory variables in explaining extreme events by using the difference in the deviance of the full model (Pawitan 2014) and a model without the explanatory variable whose importance we want to assess. The deviance test statistic,\({D}^{*}\), is given as:

$$D^{*} = - 2(\ell (\beta^{0} ) - \ell \left( \beta \right))$$
(15)

where \(\mathcal{l}{(\beta }^{0})\) is the log-likelihood of the reduced model, and \(\mathcal{l}(\beta )\) is the log-likelihood of the full model. This test statistic has a chi-square distribution with 1 degree of freedom. If H0 is rejected, there is evidence that the left-out explanatory variable contributes significantly to the prediction of the outcome.

For RF and SVM, the variable importance is the decreased ROC AUC after the permutation of the variable series (Breiman 2001b). The idea is that if the values of an important variable are permuted, keeping all other variables the same, the performance would degrade. An explanatory variable is important if permuting its values decreases the model ROC AUC relative to the other variables and unimportant if permuting its values keeps the ROC AUC almost unchanged. There can be cases where permuting a variable with very little explanatory power can cause an increase in ROC AUC due to random noise. This will end up with negative importance scores equivalent to zero importance. Since the model includes only continuous explanatory variables, there is no bias in the permutation importance measure (Strobl et al. 2007). We repeat the same process five times to increase the stability of our estimates.

Finally, for NNET, the variable importance is assessed by the “weights” method (Gevrey et al. 2003). The method entails decomposing the hidden-layer connection weights of each output neuron into components associated with each input neuron.

In addition to the previously described methods for assessing variable importance in our study, we acknowledge that even if these methods are widely used, they might not capture complex interactions and dependencies among variables in the model. To overcome these limitations, we have used the SHAP (Shapley Additive Explanations) method (Lundberg and Lee 2017) as a complementary analysis in our study. SHAP leverages game theory principles to identify the relative importance value of each explanatory variable, considering all possible combinations of variables. This approach allows us to capture both linear and non-linear effects, as well as interactions between variables, providing a more accurate and interpretable measure of variable importance. SHAP values offer insights not only into the global importance of variables but also into how variables contribute to the prediction of each individual instance. The global importance can be obtained as the average of absolute SHAP values for each explanatory variable. Moreover, the SHAP method is agnostic, as it can be applied to a wide range of models, including LR, RF, SVM, and NNET. This property enables us to assess variable importance consistently across different models, improving the comparability and generalizability of our results.

3.6 Computation

The Principal Component analysis was conducted with the Python package eofs (Dawson 2016). All the classification models have been developed using the R package caret (Kuhn 2008). We used the R package vip (Greenwell and Boehmke 2020) to quantify the variable importance of RF and SVM and kernelshap R package for finding the SHAP values (Mayer and Watson 2023). ROC AUC curve comparisons were conducted using the R package pROC (Robin et al. 2011).

4 Results

4.1 Resolving geopotential height into principal components

Figure 2 shows the eigenvalues for each PC. We can see that the elbow flattens out at nine PCs. In addition, the first nine PCs account for 90% of the total variability for Z500. Therefore, we regard the first nine PCs as adequate to describe the spatiotemporal variation of the geopotential height.

Fig. 2
figure 2

Scree Plot showing variance explained by Principal Component Analysis on Z500, with cumulative variance on the secondary y-axis

Figure 3 shows the spatial patterns associated with, and the proportion of the overall variance explained by each of the PCs. The first two PCs, which collectively describe 45% of the variance, exhibit spatial structures that are very similar to well-known circulation patterns over Europe. PC1 corresponds to the summer North Atlantic Oscillation (SNAO) described in (Folland et al. 2009). The SNAO is a dipole pattern with nodes of opposite polarity over central Scandinavia and over Greenland. The positive (negative) phase of SNAO results in warm (cool) and dry (wet) summers in Scandinavia. PC2 corresponds to the East Atlantic pattern (EA), which, unlike the SNAO, exists throughout the entire year (Wulff et al. 2017). Here one node is located west of England, and the other (of opposite polarity) over Eastern Europe and the Mediterranean.

Fig. 3
figure 3

The first nine principal components of daily anomalies at a geopotential height at 500 hPa

4.2 Variable selection

For variable selection, we used the data from 1982 to 2020 to have a complete record of all explanatory variables identified in Sect. 2.2. We first investigate dependencies between explanatory variables, and then we fit and train LR and ML models to different combinations of the explanatory variables. For SAT, the SATmax, and for SST, the Daily max lag of one day of North & Baltic Sea manipulations gave the best results. If CAPE and SATmax were in the model, then their interaction is added in the LR models as an explanatory variable. It is important to note that during the variable selection process on ML models, we found that the chosen explanatory variables are the same as those employed by LR. Therefore, this section presents the variable selection results for only the LR model. For ML models, see supplementary material.

It can be seen from Fig. 4 that all the VIFs except SATmax, SST, and TCW are less than two, which suggests a weak correlation between a given explanatory variable and other variables in the model. To better understand why SATmax, SST, and TCW have higher VIFs, we calculated the matrix of correlations between all possible pairs of explanatory variables. The correlation matrix shows that SATmax is moderately correlated to SST and TCW, and SST is moderately correlated with TCW, which explains why they have higher VIF values. We also reproduce the positive correlation between PC1 and SATmax reported in (Folland et al. 2009). PC2 is a pattern of strong northwesterly flow over Denmark at 500 hPa, and therefore the negative correlation between PC2 and both CAPE and TCW means that unstable and moist air comes with strong southeasterly flow, and this is in accordance with domain knowledge (e.g., Solantie et al. 2006). Furthermore, PC3 has positive correlations with SATmax and TCW. This is not straightforward to explain since PC3 represents a northeasterly flow with its maximum intensity north of Denmark.

Fig. 4
figure 4

Correlation matrix depicting correlation of explanatory variables (lower right corner) and VIFs between all explanatory variables (left)

Table 2 shows the performance metrics of each explanatory variable independently and the combinations of variables with the best performance. The Brier score is close to zero and almost the same for all combinations of explanatory variables, which means that the models are equally well-calibrated. In the ROC curve analysis, the mean AUC for all models except SATmax and SST alone is above 0.70, which shows that models based on CAPE, PC Z500, or TCW have good discriminative ability. The Mann–Whitney U test also suggests skillful models with p < 0.05 (not shown). However, these single-variable models have FPR values around 0.29, which for an imbalanced dataset like the one used in our analysis means a lot of false positives. That also shows up in a poorer PR curve in Fig. 5.

Table 2 Performance metrics for LR models with different (combinations of) explanatory variables for 1982–2020. Data are displayed as mean (Cross-validation values of the first quartile/third quartile)
Fig. 5
figure 5

ROC curves (a) for the models with the highest ROC AUC and the corresponding PR curves (b) for the period 1982–2020. The dashed lines represent the baseline. The colored bands indicate the 95% CI estimated with stratified bootstrapping

The model combining CAPE, PC Z500, and TCW as explanatory variables outperform the single-variable models. Adding SATmax as an explanatory variable increases the PR AUC only a little, but because the interaction with CAPE is statistically significant (Table 3), we choose to include this variable in the model. On the other hand, adding SST provides no additional performance improvements. It can be seen from Fig. 5 that the models “PC Z500 & CAPE & SATmax & TCW” and “PC Z500 & CAPE & SATmax & TCW & SST” have almost the same PR AUC (the highest). Therefore, PC Z500, CAPE, SATmax, and TCW are considered the most important explanatory variables, and we use them in further evaluation.

Table 3 The LR odds ratios and the corresponding confidence intervals in parenthesis for all data, and thinned to every third and every fifth day, respectively

Following the model evaluation procedures of Sect. 3.2.1, CAPE, TCW, PC1, PC3, PC4, PC6, the interaction of CAPE and SATmax are found to be statistically significant explanatory variables for the model trained with all and the thinned data (Table 3). We can conclude that for these explanatory variables, the serial correlation of residuals does not affect the confidence intervals to the extent that inference results are at risk of being misinterpreted. The variables CAPE and TCW stand out as having a positive influence on the probability of extreme rainfall; the odds ratios of SATmax are larger than unity, meaning that high SATmax favors extreme precipitation, although the odds ratio is not robust across the subsampled datasets. These findings are in accordance with the arguments we gave in Sect. 2.2 for including these explanatory variables.

The interaction of CAPE and SATmax seems to have a (slightly) negative impact on the probability of an extreme event. In this case, a significant odds ratio of 0.99 for the interaction term suggests that the effect of CAPE on the probability of an extreme event becomes weaker as SATmax increases. Therefore, even though the odds ratios of CAPE and SATmax are both larger than unity, the interaction term decreases the overall effect of CAPE and SATmax on the probability of an extreme event. The interaction term is best understood by examining how the odds ratio of CAPE in the LR changes as the values of SATmax change (Fig. 6).

Fig. 6
figure 6

CAPE odds ratio as a function of SATmax. The solid line is the central estimate; dashed lines are the 95% confidence interval calculated using the Delta Method (Oehlert 1992). The trend indicates that the overall effect of CAPE on extremes decreases with increasing values of SATmax

Turning to the PCs of Z500, the odds ratio of PC1 is smaller than unity, and the odds ratio of PC3 is larger than unity, and by conferring with Fig. 3, we can interpret that as easterly flows increase the probability of an extreme precipitation event. Similarly, the odds ratio of PC4 being below unity can be interpreted as southeasterly flow increases the probability of extreme precipitation events. This is in agreement with local experience.

The PC6 has a pattern, which implies a stagnant atmosphere over Denmark. Therefore it is surprising that it has a significant odds ratio. That said, the exact shape of the higher-order PC usually is sensitive to the data. This effect is not included in the uncertainty intervals in Table 3, which, therefore, may be underestimated.

4.3 Variable importance

Regarding hyperparameter tuning for RF, an ensemble of 700 trees was found to be sufficient for achieving good performance, with negligible benefits for larger forests. The optimal hyperparameters for each ML algorithm can be found in the supplementary material. We do not impose pre-determined interactions on the ML models to allow them to leverage their inherent capabilities in handling variable interactions. This approach enables fair comparisons and allows each model to utilize variables and their interactions in the manner that best complements its underlying algorithms.

CAPE and TCW obtained the highest importance across all models, with CAPE being the first by RF, NNET, and TCW for LR and SVM. The explanatory variables PC1 and PC6 were consistently rated among the top six variables of the models. A difference was that PC2 was among the top 3 variables in the NNET model, but it had a minor contribution to LR, RF, and SVM. The most notable difference was observed regarding the SATmax, which played no role in LR and a minor role in the RF model, while it was the fourth most important explanatory variable for predicting extreme events in the NNET, SVM model (Fig. 7).

Fig. 7
figure 7

Importance of explanatory variables in predicting extreme precipitation events. The Type II Chi-square deviance test statistic for LR, permutation-based performance for RF, SVM, and Gevrey Importance for NNET. The dashed line signals statistical significance threshold (p values ≤ 0.05). If a particular explanatory variable is absent from the plot, it indicates that it was considered unimportant by the model

Regarding SHAP values (Fig. 8) CAPE and TCW also obtained the highest importance across all models, with CAPE being also the first by RF, NNET, and TCW by LR and SVM. PC1 was among the top four variables of all the models. SATmax has a larger importance for NNET and SVM compared to LR, and RF, which is in accordance with Fig. 7. In Fig. 8, SATmax is ranked higher for all models except NNET compared to its ranking in Fig. 7. This difference suggests that SATmax is more important in predicting extreme precipitation events when considering the SHAP values. Figure 8 (right) shows the bee swarm plots where the explanatory variables are not only ordered by their effect on the prediction but also provide insights into how higher and lower values of each variable will affect the outcome. Each dot represents an observation. It can be seen that for all models, higher values of CAPE and TCW have a positive effect on the prediction, while higher values of PC1 have a negative effect. In general, the results for all models in Fig. 8 (right) agree with the LR model coefficients of the statistically significant explanatory variables that are robust across the subsampled datasets (Table 3).

Fig. 8
figure 8

In the left: variable importance, evaluated using the mean absolute SHAP values, for all models. In the right: Beeswarm plots of SHAP-values for the explanatory variables for all models. Variables are sorted by their mean absolute SHAP value in descending order with the most important variables at the top. Each dot represents an observation in the study

The SHAP interaction values for CAPE and SATmax across all models demonstrate that the importance of CAPE on extreme events decreases as SATmax increases (see supplementary material). This observation is consistent with the findings presented in Fig. 6 where the relationship between the interaction and extreme events was specifically examined within the LR model.

4.4 LR versus ML

In contrast to LR and NNET, ROC AUC decreased for RF and SVM when applied to the test set (Table 4) compared to the training set, but all models still showed high performance (ROC AUC greater than 0.80). In the test dataset, the ML algorithms with the greatest ROC AUC were RF, NNET (AUC = 0.87) (Fig. 9), followed by LR (AUC = 0.86), and SVM (AUC = 0.80). However, the Delong-test results in Table 4 indicated that there were significant differences only between ROC AUC of SVM and that of the other models. For accuracy at the optimal threshold, LR and RF were the best-performing algorithms for predicting extremes events (ACC = 0.75, 0.73), and they had the smallest FPR (FPR = 0.26, 0.28), but NNET had the highest TPR (TPR = 0.90 while TPR = 0.87 and 0.83 for RF and LR). RF had the best area under the PR curve (PR AUC = 0.39), followed by LR (PR AUC = 0.38) and NNET (PR AUC = 0.37). Brier score did not show any difference among models. SVM had the worst performance in terms of all metrics.

Table 4 ROC AUC (95% CI) performance comparison of the four models applied to the training and test sets. 95% CI is computed with 2000 stratified bootstrap replicates
Fig. 9
figure 9

Performance metrics of four models predicting the outcomes of extreme precipitation events

When non-linear effects were incorporated into LR via restricted cubic splines, there was more overfitting, and the performance of LR was not increased compared to traditional LR (see supplementary material).

Figure 10 highlights that the best-performing models give similar predictions. LR, RF, and NNET predictions are strongly correlated (LR-RF, 0.84; LR- NNET, 0.92; RF- NNET, 0.92). On the other hand, the correlations of LR, RF, and NNET predictions with SVM ditto are weak. The predictions for the observed extreme events only follow the same pattern: high correlation between LR and RF, NNET, and low correlation with SVM. The distribution of predicted probabilities for non-extreme events coincides with the expected left-skewed pattern in all models. In contrast, the distribution of the expected probabilities for extreme event occurrences deviates from the ideal case of a highly right-skewed distribution.

Fig. 10
figure 10

Correlation coefficients for all data, extremes, non-extremes, and scatter plots of the probabilistic model predictions between the different models. In the diagonal are the density plots of probabilistic predictions of extremes/non-extremes for each model

5 Discussion

5.1 Physical interpretation and importance of the explanatory variables

We found that CAPE, TCW, some PCs of Z500, and the interaction term of CAPE and SATmax are significant variables for explaining the occurrence of extreme precipitation events in the Copenhagen area. As deduced theoretically, CAPE and TCW are the most important variables. Larger values of CAPE mean a larger amount of potential energy to be released by convection, and we expect this to have a positive effect on the probability of an event. Our analysis confirms this since the confidence interval of the odds ratio of CAPE is entirely above unity. Furthermore, also TCW has a confidence interval for the odds ratio, which is entirely above unity, which is as expected since we expect a larger TCW to mean a larger probability of extreme precipitation. The fact that these conclusions for CAPE and TCW are stable under subsampling and that these explanatory variables were found to be the most important variable by both established and more modern variable importance metrics for LR and ML models underlines the robustness of this conclusion. The rank of these explanatory variables may differ (e.g., CAPE is the first for NNET, RF, and second for LR, SVM) because each algorithm has a different approach to modeling the relationship between the explanatory variables and the response.

Furthermore, some of the Z500 PCs have odds ratios different from unity. A closer examination of these reveals that easterly and southeasterly flow favors extreme events, even on days where values of CAPE and TCW are low. We interpret this as already active convective systems being advected into the areas. This is, however, a less dominant mechanism since the PCs have much lower variable importance than CAPE and TCW.

We regarded SST as an explanatory variable of extreme precipitation events. However, after incorporating TCW, we found that SST had no significant impact. This is probably because SST variability is too small to influence the formation of extreme precipitation events, and that SST and TCW are correlated. On the other hand, SATmax improved the PR AUC slightly, but different models disagree on its importance.

The odds ratio for the interaction term of CAPE and SATmax is less than unity, which is surprising. We would have expected that the combination of large CAPE and high SATmax was in favor of releasing extreme events, but this conflicts with our results. The fact that the ML models also learn the same relationship consolidates this finding.

It is interesting to note that the most important explanatory variables; (CAPE, TCW, and Z500 PC’s) all describe the state of the free atmosphere over the region without regard to the detailed geography. This suggests that the models presented here and the conclusions about their comparison can be applied to other regions (with other fitted parameter values). Also, the physical arguments given earlier for choosing these as potential explanatory variables support this view. Given the limitations in the variable selection, it is possible to argue that expanding the set of variables used in the ML models could potentially yield different results in terms of variable importance. Including more variables may uncover additional interactions that were not captured by the original subset of variables.

The model with PC Z500, CAPE, SATmax & TCW as explanatory variables had high scores in all performance metrics except PR AUC. The reason for this is the high FPR values. Although we have a dense network of rain gauges in this case study, it is still possible to miss events that actually occurred. For the Copenhagen case, we would, e.g., miss any cloudburst that hits right off the coast in the sea. This fact could contribute to high FPR values. To validate this hypothesis, we conducted the analysis using data just from one gauge (the gauge with the least missing data). The results showed (see supplementary material) that the PR AUC is significantly lower even if the ROC AUC is comparable with the full network performance. These findings support our claim for higher FPR values attributed to missed events. Future work should therefore focus on augmenting the database by testing the sensitivity towards defining a region, e.g., by using fewer gauges and/or using bias-corrected weather radar data.

5.2 LR versus ML

We hypothesized that ML would be superior to traditional LR in predicting extreme events. Our hypothesis was partially supported in that LR performed worse than RF and NNET but better than SVM. Differences in the AUC between LR and the best-performing algorithms were not statistically significant, indicating only minor differences in overall fair explanatory performance across all models. Moreover, LR and NNET did not exhibit overfitting in contrast to RF and SVM. LR models might be more beneficial in this study than ML methods due to their transparency and interpretability. LR models also have a solid theoretical background, which allows for the use of well-defined statistical tests to assess the statistical significance of the explanatory variables.

Our findings may have implications for the design of future ML applications. Despite the simple model, data-driven ML methods yielded to a small performance benefit compared to the LR model. Finally, the difficulty of interpreting ML models is a barrier to their use in extreme event prediction. Future research could extract information from the black box to create interpretable but accurate statistical models. Each dataset is unique, and there is no ‘free lunch' (Wolpert and Macready 1997), i.e., an algorithm performing well on a particular class of problems. Evaluating multiple algorithms against LR is critical to see if one outperforms the other; if performance is comparable, the simplest and most interpretable model should be employed.

5.3 Strengths and limitations

A systematic comparison between LR and several machine learning algorithms was conducted with a focus on their suitability in our setting of studying extreme precipitation events. Performance metrics specifically designed for unbalanced data were employed, and the ML models were optimized through a grid search approach. An independent dataset was utilized as the test dataset to enhance the validity of the findings, and variable importance metrics for all models were also employed to complement one another.

Extreme precipitation events are rare occurrences by nature, resulting in a limited amount of available data for training and evaluating models. The presented case area has an excellent climatological data set of 15 rain gauges within a relatively small geographical area and 42 years of recordings, yet the total number of observed extreme events that we are training models on is just 557. The lack of comprehensive, long-term observations is a general problem when studying the climatology of extreme rainfall, and relevant methods have to deal with this. The scarcity of extreme events restricts the number of variables that can reasonably be included in the analysis and the ability to employ very complex models. This is due to identifiability issues and that the risk of overfitting increases (Kuhn and Johnson 2013). In this study, we have thought it important to leverage domain expertise for selecting and engineering features that capture various aspects of extreme event behavior. Regarding variable selection, this study's approach may favor the LR model to some extent, as the process of selecting variables for ML models aligns with that of the LR model. This makes the model performance comparison straightforward and allows us to investigate how well LR and ML models are able to utilize information based on a mix of data and domain expertise. However, this may also be considered a limitation of our study since one of the most significant advantages of ML over LR is its ability to model complex, non-linear relationships between several explanatory variables and outcomes. The explanatory variables-outcome complexity should not be too low for ML to provide a meaningful advantage over LR, which is likely one the main reasons why the ML models are not able to outperform LR here. We hypothesize that this is likely to hold for other similar studies due to the lack of comprehensive observational records. For ML methods, it has been stated as a rule of thumb that almost ten times as many events per variable are required to get consistent results compared to traditional statistical modeling (van der Ploeg et al. 2014). Future research could delve more into how far we can push the number of features in different types of ML models without running into lack of identifiability and over-fitting issues in studies of extreme precipitation events. It would also be interesting to see if ML models with more variables allow for physically interpretable insights that enhance our understanding of the climate system, such as drawing clear conclusions on variable importance. This introduces the challenge of deciding which variable selection method to employ for the ML models, potentially leading to variations in the selected variables and confounding a fair comparison across models.

Additionally, it is important to note that the neural network model used in this study is not particularly complex, as the cross-validation process limited the size of the hidden layer to only two. This could reduce the ability of the neural network to capture more complex relations in the data, but increasing the depth of the network also leads to lack of identifiability and over-fitting on small data sets.Furthermore, the LR depends on the fulfillment of specified assumptions (i.e., observation independence, no multicollinearity). Violating these may impact the quality of the analysis. The results of this study imply that any assumption violations with respect to LR do not significantly influence the performance quality because LR performed quite similarly to ML. Since non-linear effects were tested for LR via cubic splines without increasing the performance, their difference in performance is likely due to the lack of inclusion of variable interactions, which have to be included manually in LR by the user, while the ML models capture them automatically.

Lastly, CAPE, Z500, and TCW are simulated reanalysis outputs, which come with their own uncertainties and errors. One source of uncertainty is the used data assimilation method since different methods can yield to different results. Therefore the choice of method can impact the accuracy of the reanalysis data. Another source of uncertainty is the quality and consistency of the used observations. Older observations might not be as accurate as current ones because the observational methods have changed. This can impact the quality of the reanalysis data, particularly for earlier periods. Finally, the numerical models to produce the reanalysis data, including their spatial and temporal resolution, also have their own uncertainties.

6 Conclusion

In conclusion, the logistic regression framework was found to be an effective tool in modeling the occurrence of extreme precipitation events using meteorological drivers as explanatory variables. Considerable effort was put into improving model performance, generalization, and interpretation. The results showed that CAPE and TCW were the most important explanatory variables, while some of the PCs of Z500 and the interaction term of CAPE and SATmax were also significant. Our analysis confirmed the relationship between larger values of CAPE and TCW and a higher probability of extreme precipitation. The results also indicated that certain flow directions are favorable for extreme events. The results showed that SST had no significant impact, while SATmax improved the PR AUC slightly. The interaction term of CAPE and SATmax was found to be less than unity, which was surprising and requires further investigation. The model with PC Z500, CAPE, SATmax, and TCW as explanatory variables had high scores in most performance metrics, but high FPR values indicate a need to augment the precipitation data in future studies.

A classic LR performs similarly to more complex ML algorithms in a classification setting with four explanatory variables. This study demonstrates the value of comparing standard regression modeling to ML, mainly when a small number of well-understood, strong explanatory variables are used. Given the increasing availability of data technologies, ML may play a more significant role in prediction in the future. However, we still recommend caution in the optimism of using ML since its benefits depend on various criteria such as sample size, number of explanatory variables, and the complexity of the interactions of the explanatory variables. All of which are significant obstacles when working with rare extreme events and limited observational records.

In this study, a simple dataset gives reliable information on what circumstances lead to hourly precipitation extremes. Application to other locations to test transferability in both model structure and actual model remains to be tested, but early results are promising. This approach could be useful in data-sparse regions or for predicting the impacts of climate change where the physical understanding of convective rainfall is limited.