1 Introduction

Global climate change is increasingly affecting rainfall intensity, making extreme rainfall events more frequent. In addition, urbanization has a significant impact on the rainfall-runoff relationship (Ferguson and Suckling 1990; Aronica et al. 2012). Floods are the most common natural hazard-related disaster in the world (Berz 2000) with severe social and economic damages. Floods are more severe in coastal cities with developed economies that are affected by rainfall and tide level. With climate change and sea level rise, the risk of coastal floods continues to increase. Because many coastal cities are low-lying and have poor drainage systems for rainwater, they are vulnerable to flooding. For example, roughly 4354 people were missing or died and about USD 70.6 billion in economic losses were incurred during coastal flood disasters in China between 1989 and 2014 (Fang et al. 2017). Rapid flood prediction gives urban decision makers sufficient time to take flood prevention measures to mitigate social and economic losses (Zevenbergen et al. 2008).

In the last few decades, the study of prediction models for urban floods has made significant progress, and the models can be divided into physical models and data-driven models. The physical models are mainly hydrological–hydraulic models, with relatively mature technical development and high computational accuracy. In recent years, with the development of numerical simulation technology, hydrological–hydraulic models have been widely used in urban flood simulation (Bates et al. 2010; Frank et al. 2011; Yamazaki et al. 2011). Sidek et al. (2021) constructed an urban flood simulation model based on InfoWorks ICM, an advanced integrated catchment modeling software, which could effectively predict the depth and extent of floods. Wu et al. (2017) developed a two-dimensional flood inundation model to simulate dynamic urban flooding by coupling a storm water management model (SWMM) and LISFLOOD-FP.Footnote 1 It reveals the evolution of urban flooding under different storms, sea level rise, and ground subsidence scenarios in the future. Thorndahl et al. (2016) built an integrated flood and drainage system model based on MIKE FLOOD.Footnote 2 This model is calculated using only a two-dimensional surface model, and generates detailed flood maps with high resolution radar rainfall data. However, the complexity of hydrological–hydraulic models and longer computational time limit the development of models (Liu et al. 2015; Huang et al. 2017; Zanchetta and Coulibaly 2020). Particularly for study areas with a large spatial extent or high spatial resolution of data, the computational time of hydrological–hydraulic models will be much longer (Bermúdez et al. 2018). For flood simulation studies in urban areas, a large spatial extent and high spatial resolution are required. Even the fastest physical model simulation methods still consume a large computational resource, and it is difficult to meet the demand for real-time flood prediction (Bhola et al. 2018). Moreover, hydrological–hydraulic models have high requirements for flood observed data and need to calibrate a large number of parameters to describe flooding processes accurately. The significant lack of flood observed data has affected its generalization and application.

Data-driven models are also commonly used in the field of urban flood prediction. These models identify patterns between inputs and outputs by learning based on a large amount of data. Such models can give model outputs directly based on model inputs, thus achieving rapid flood prediction. Ma et al. (2021) generated risk maps of flash floods by proposing a model for flash flood risk assessment based on extreme gradient boosting (XGBoost) at the county level in Yunnan Province, China. Wu et al. (2020) constructed a fast prediction model for urban floods based on the gradient boosted decision trees (GBDT) algorithm to predict the maximum water depth of urban inundation points. Sadler et al. (2018) proposed to train a random forest (RF) algorithm by using large amounts of environmental data to assess the severity of flooding in coastal cities. In these studies, a large amount of flooding data is required for data-driven models to construct effective flood prediction models. But in many regions, data-driven models are limited in application due to the scarcity of measured data.

One feasible solution is to couple a physical model with a data-driven model (Reichstein et al. 2019). The physical model is used to simulate lots of flooding data to train the data-driven model. For example, Kabir et al. (2020) combined LISFLOOD-FP and a convolutional neural network (CNN) for fast prediction of water depth of river floods. Berkhahn et al. (2019) constructed a rapid urban flood prediction model based on integrated neural networks using the HYSTEM-EXTRANFootnote 3 2D model to generate a database of flood maps. Löwe et al. (2021) proposed a method that integrated MIKE FLOOD and CNN to predict water depth of urban river floods. However, the pre-sort-based algorithms, level-wise growth tree strategy, and one-hot encoding used by boosting tools in the above studies resulted in larger memory usage and slower training speed. The light gradient boosting machine (LightGBM) is a machine learning framework based on GBDT originally proposed by Microsoft in 2017. The histogram-based algorithm used by the LightGBM model reduces memory usage and speeds up training. The leaf-wise growth tree strategy used by the LightGBM model enables the model to have higher accuracy. Moreover, the LightGBM model supports input of categorical features directly, which simplifies data preprocessing. Because of these merits, the LightGBM model is extremely suitable as a machine learning algorithm for rapid simulation of urban flooding. It has been widely applied in many fields such as rainfall correction (Lee et al. 2019) and forest fire analysis (Nguyen et al. 2022), but there are few studies related to urban floods (Cui et al. 2021).

The objectives of this study are threefold: (1) An approach that couples the LightGBM model and hydrological–hydraulic model is proposed for rapid prediction of flood inundation, where the hyperparameters of the LightGBM model are optimized by grid search algorithm and K-fold cross-validation. (2) The performance of the LightGBM model is compared with the XGBoost, RF, and k-nearest neighbor (KNN) models. (3) The importance of rainfall, tide level, and other input variables for coastal urban flooding is quantified based on the LightGBM model and the Gini index. Section 2 presents the basic methodological theory used in the study. Section 3 describes the study area and data. Section 4 describes the rapid prediction model for urban floods based on the LightGBM model. A brief conclusion and potential improvements in the future are provided in Sect. 5.

2 Methods

This section summarizes the study framework of the rapid prediction model for urban flooding, introduces the theory of one- and two-dimensional coupled hydrological–hydraulic model, light gradient boosting machine, K-fold cross-validation, and grid search algorithm, and selects four model accuracy evaluation metrics for this study.

2.1 Study Framework

In this study, we adopted the LightGBM model coupled with a physical model to construct a rapid prediction model for urban floods. The flowchart of the proposed prediction model is shown in Fig. 1. First, we designed 49 combined scenarios of rainfall-tide levels. A one- and two-dimensional coupled hydrological–hydraulic model based on the personal computer storm water management model (PCSWMM) was constructed to simulate the inundation distribution of flooding in the study area. Next, we extracted 10 variables from the rainfall series and tide level series and used them together with the location of flood-prone points as input variables for the machine learning model. A LightGBM model optimized by K-fold cross-validation and grid search algorithm was proposed to predict the maximum water depth at flooded points in urban floods. Then the constructed LightGBM model and other benchmark models were evaluated and compared on the test set based on the evaluation metrics. Finally, we determined the relative importance of the variables that cause urban flooding by calculating the Gini index.

Fig. 1
figure 1

Flowchart of the proposed flood prediction model. DEM Digital elevation model, PCSWMM Personal computer storm water management model, LightGBM Light gradient boosting machine, RF Random forest, XGBoost Extreme gradient boosting, KNN K-nearest neighbour, RMSE Root mean square error, MAPE Mean absolute percentage error, SSE Sum of squared errors, NSE Nash-Sutcliffe efficiency coefficient

2.2 One- and Two-Dimensional Coupled Hydrological–Hydraulic Model Based on the Personal Computer Storm Water Management Model (PCSWMM)

The PCSWMM is a hydrological–hydraulic model developed by the Computational Hydraulic Institute (CHI) of Canada based on the storm water management model (SWMM), which is widely used in studies of drainage networks and management of urban storm water (Ahiablame and Shakya 2016; Wu and Huang 2016). We constructed a flood simulation model based on the PCSWMM to provide data for the machine learning model.

The calculations of the PCSWMM include surface runoff, flow concentration, flow routing, and 1D-2D coupled simulation. The surface runoff is calculated by the Manning formula:

$$Q=\frac{W}{n}{\left(d-{d}_{p}\right)}^{5/3}\cdot {S}^{1/2}$$
(1)

Where Q is the outflow rate (m3/s); W is the width of the subcatchment (m); n is the surface Manning factor; d is the water depth of the reservoir (m); dp indicates the maximum depression depth (m); and S is the average slope of the subcatchment.

The nonlinear reservoir method is used for calculating flow concentration. It is calculated by associating the continuity equation and the Manning formula, where the continuity equation is:

$$\frac{{\text{d}}V}{{\text{d}}t}=A\frac{{\text{d}}d}{{\text{d}}t}=A{i}^{*}-Q$$
(2)

Where V is the total volume of water in the catchment (m3); d is the water depth (m); t is the time (s); A is the area of the subcatchment (m2); i* is the net rainfall intensity (m/s); and Q is the surface runoff (m3/s).

The flow routing is calculated using the completely dynamic wave method, and the continuity (Eq. 3) and momentum (Eq. 4) are:

$$\frac{\partial Q}{\partial x}+\frac{\partial A}{\partial t}=0$$
(3)
$$gA\frac{\partial H}{\partial x}+\frac{\partial \left({Q}^{2}/A\right)}{\partial x}+\frac{\partial Q}{\partial t}+gA{S}_{f}=0$$
(4)

Where Q is the flow (m3/s); x is the distance (m); A is the cross-sectional area (m2); t is the time (s); H is the water depth (m); G is the acceleration of gravity (m/s2); and Sf is the friction slope.

2.3 The Light Gradient Boosting Machine (LightGBM) Model for Urban Flooding Prediction

The light gradient boosting machine is a machine learning framework based on GBDT. Compared with other tree-based learning algorithms, the LightGBM model has significant improvements in training speed, computational efficiency, and prediction accuracy. The model is particularly suitable for regression, classification, sorting, and other machine learning tasks (Ke et al. 2017).

The LightGBM model uses a histogram-based algorithm (Fig. 2a) for decision tree learning (Ranka and Singh 1998; Li et al. 2007), which buckets the continuous feature values into k bins, and stores them in a histogram of width k. This has great advantages in terms of training speed and memory usage.

Fig. 2
figure 2

The features of light gradient boosting machine. a Histogram-based decision tree algorithm, which buckets continuous feature values into discrete bins; b Level-wise tree growth chooses all leaves of the level to grow (i), and leaf-wise tree growth chooses the leaf with max delta loss to grow (ii); c 10-fold cross-validation splits the data into k subsets randomly, k-1 subsets are used for training, leaving one subset as the test set; d Grid search algorithm traverses all combinations of hyperparameters within a given search range, and determines the optimal hyperparameters set

On the basis of the histogram algorithm, the LightGBM model is further optimized by changing the strategy of tree growth. Unlike most tree-based learning algorithms that use a level-wise tree growth strategy, the LightGBM model grows trees leaf-wise with a depth limit. Figure 2b (i) shows the schematic diagram of level-wise tree growth. The level-wise tree growth strategy can split the leaves of the same layer at the same time as the decision tree grows, but the gain of information obtained from splitting the leaves at a specific layer is different from the gain of information obtained from splitting the other leaves at the same layer (Liang et al. 2020). The processing of leaves with low gain of information provides little improvement in prediction accuracy, and increases memory consumption. Figure 2b (ii) illustrates the leaf-wise growth strategy. The leaf-wise growth tree splits only the leaf with the greatest gain of information, and leaf-wise growth strategy tend to achieve lower loss than level-wise growth strategy. But it may lead to overfitting, so we need to limit tree depth (Cui et al. 2021).

In addition, the LightGBM model supports the input of category features directly. Most other machine learning algorithms require one-hot encoding to represent categorical features. This is not optimal for tree-based learning algorithms. It causes the data to become sparse and reduces the computational efficiency.

2.4 K-Fold Cross-Validation and Grid Search Algorithms

The goodness of fit of the training set and the prediction accuracy of the test set for a given machine learning algorithm are mainly determined by its hyperparameters. For example, the learning rate, the number of estimators, and the number of leaves in the LightGBM model all affect the reliability of the model directly. To maximize the performance of the LightGBM model, the hyperparameters of the LightGBM model are optimized using K-fold cross-validation and grid search algorithms based on the open source machine learning framework named scikit-learn (Varoquaux et al. 2015).

K-fold cross-validation is a type of non-exhaustive cross-validation (Stone 1974). It can be used effectively to evaluate the generalization ability of a machine learning model. K-fold cross-validation divides the data set into k subsets randomly and evenly as shown in Fig. 2c. For each training and testing of the model, k-1 subsets are used for training the model, leaving one subset as the test set and the training and testing are repeated k times to let each subset be used as a test set for one time. Finally, the results of k tests are averaged as the model evaluation value. In general, k is assigned to 10 (McLachlan et al. 2004), which means 10-fold cross-validation is used as the evaluation method of performance during model training.

Grid search is an exhaustive algorithm widely used for hyperparameter optimization of machine learning models (Wang et al. 2013; Pontes et al. 2016; Ogunleye and Wang 2019). Grid search can traverse different combinations of hyperparameters within a given search range, as shown in Fig. 2d. The basic idea of the grid search strategy is to find the optimal set from the first dimension to the n-th dimension, respectively. When searching for the optimal set from the i-th dimension, the values of other dimensions are kept constant. Then the performance of models under different sets of hyperparameters is compared by the K-fold cross-validation approach, and thus determines the most suitable model hyperparameters for the current training data (Liu et al. 2014).

2.5 Model Accuracy Evaluation Index

A set of evaluation metrics of model accuracy is required to assess the performance of a model at each phase of model construction. In this study, we used root mean square error (RMSE), mean absolute percentage error (MAPE), sum of squared errors (SSE), and Nash-Sutcliffe efficiency coefficient (NSE) to evaluate the performance of the model.

The RMSE is the standard deviation of the residuals, as shown in Eq. 5. A smaller RMSE value indicates better model quality. The MAPE expresses the prediction accuracy as a ratio defined by Eq. 6. The ratio is the average rate of deviation of predicted values from observed values. A lower MAPE value represents higher predictive accuracy of the model. The SSE is the sum of squared errors between the predicted and observed values, as shown in Eq. 7. The closer SSE is to 0, the better the model prediction is. The NSE (Nash and Sutcliffe 1970) is commonly used to evaluate the simulation results of hydrological models, as shown in Eq. 8. The model with NSE value closer to 1 has better predictive ability than the model with NSE value closer to 0.

$$RMSE=\sqrt{\frac{1}{n}{\sum }_{i=1}^{n}{\left({H}_{a}^{i}-{H}_{f}^{i}\right)}^{2}}$$
(5)
$$MAPE=\frac{100\mathrm{\%}}{n}{\sum }_{i=1}^{n}\left|\frac{{H}_{a}^{i}-{H}_{f}^{i}}{{H}_{a}^{i}}\right|$$
(6)
$$SSE={\sum }_{i=1}^{n}{\left({H}_{a}^{i}-{H}_{\rho }^{i}\right)}^{2}$$
(7)
$$NSE=1-\frac{{\sum }_{i=1}^{n}{\left({H}_{a}^{i}-{H}_{f}^{i}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({H}_{a}^{i}-\overline{{H}_{a}}\right)}^{2}}$$
(8)

Where \({H}_{a}^{i}\) and \({H}_{f}^{i}\) are the real and predicted values of the i-th sample, respectively; \(\overline{{H}_{a}}\) is the average value of the observed data; and n is the number of examples.

3 Study Area and Data

This section provides a detailed description of the study area and the required data.

3.1 Study Area

Haikou, the capital city of Hainan Province, is located at the northernmost corner of Hainan Island (Fig. 3). It is separated from Guangdong Province by the Qiongzhou Strait, and is located between \({19}^{\circ }{31}^{\mathrm{^{\prime}}}{32}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}-{20}^{\circ }{04}^{\mathrm{^{\prime}}}{52}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}\) North latitude and \({110}^{\circ }{07}^{\mathrm{^{\prime}}}{22}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}-{110}^{\circ }{42}^{\mathrm{^{\prime}}}{32}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}\) East longitude. The city is the economic and cultural center of Hainan Province.

Fig. 3
figure 3

The Haidian Island study area on Hainan Island, China, and the location of 10 flood-prone points

Haikou is in the tropical climate zone. It has a tropical seasonal climate with high average temperatures throughout the year. Rainfall is plentiful due to frequent typhoons. Precipitation mainly occurs in May to October, and the average annual precipitation is 1827 mm. The dense river and water systems in Haikou have a key function for the flood control of the city.

Haidian Island is separated from the main city by two rivers (see Fig. 3) and is the largest island in Haikou with an area of about 14 km2. The urban drainage system and river channels together form the flood control and drainage system of Haidian Island. The flat topography and the jacking effect of seawater on the drainage outlet of the island result in poor drainage capacity, thus the island is susceptible to seawater backflow.

3.2 Data

We selected Haidian Island as the case study area and constructed a 1D-2D coupled flood simulation model based on the PCSWMM. The required data include the digital elevation model (DEM), drainage network, 1D inspection wells, subcatchment area, and water blocking obstructions. The DEM (Fig. 4a) data were obtained from the elevation data of Haidian Island provided by the Haikou Municipal Water Bureau. Drainage network and 1D inspection wells information were also provided by the Haikou Municipal Water Bureau. The data were preprocessed using ArcGIS software to get the 1D inspection wells and pipe network in the study area (Fig. 4b). According to the river system in the study area, each river section is generalized into different channels. The study area was divided into 13 subcatchment areas, as shown in Fig. 4c, by combining the topographic and geomorphic features, drainage pipes, and roads layout. The water blocking obstacles are mainly buildings in the urban area. Obstacles in the study area were identified by identifying buildings on satellite images of urban areas (Fig. 4d). We used the observed water depth data to calibrate the 1D-2D coupled flood simulation model. The observed water depth data were collected partly using a water level ruler through field investigation and partly provided by the Haikou Municipal Water Bureau during Typhoon Rammasun in July 2014.

Fig. 4
figure 4

Key data for the construction of the 1D-2D coupled flood simulation model based on the personal computer storm water management model (PCSWMM) on Hainan Island, China. a Elevation; b Distribution of 1D inspection wells and pipelines; c Schematic diagram of subcatchment division; d Distribution of water blocking obstacles (mainly buildings)

After constructing the flood simulation model using the above data, we needed to input the boundary conditions into the model to simulate the data of flood distribution. The generalized extreme value (GEV) function was adopted to fit the distributions of rainfall and tide level (Xu et al. 2019; Xu et al. 2022). The parameters of the GEV function were estimated by the maximum likelihood method. The rainfall distribution on Haidian Island was described by the GEV function with location parameter ξ = 122.51, scale parameter α = 46.78, and shape parameter k = 0.115. The tide level distribution was also described by the GEV function with location parameter α = 2.349, scale parameter α = 0.380, and shape parameter k = − 0.048. Then, the design rainfall and tide peaks for seven return periods from 5 years to 100 years were calculated by the GEV function (Table 1). The rainfall event on 18 July 2014 was taken as a typical rainfall and together with the tide level process on the same day, it created the most severe flooding scenario (Fig. 5). Finally, the typical rainfall and tide level processes are scaled up by the same magnitude factors using the same ratio method since it is easy to calculate and can maintain the shape of the typical rainfall and tide level processes, and the magnitude factors were determined by the rainfall and tide peak. The magnification ratios obtained by this method are shown in Table 1. Thus, we obtained the temporal distribution of the designed rainfall (Fig. 6a) and tide level (Fig. 6b) for different return periods. The duration of both rainfall and tide levels were selected to be 24 hours, with a temporal resolution of 1 hour. We combined rainfall and tide level data to obtain 49 scenarios as boundary conditions for the flood simulation model.

Table 1 Magnification ratios obtained by the same ratio method
Fig. 5
figure 5

Rainfall and tide level on Haidian Island at the time of typhoon Rammasun on 18 July 2014

Fig. 6
figure 6

Sequences of rainfall and tide levels for seven return periods on Haidian Island

4 Results

This section demonstrates the results of this study, which include the urban flooding model by PCSWMM, the rapid prediction model of urban floods based on LightGBM, comparison of results between LightGBM and other machine learning models, and the variable importance.

4.1 Establishment of an Urban Flooding Model by Using the Personal Computer Storm Water Management Model (PCSWMM)

To build a 1D-2D coupled flood simulation model, the 1D drainage model with 2,667 inspection wells and 2,042 conduits was first constructed. Then we constructed the 2D model of surface inundation based on the 2D nodes, 2D grids, and DEM. Finally, we performed the coupling of the 1D-2D models by connecting the 1D drainage model and the 2D surface inundation model using the bottom orifice connection. A grid type with a hexagonal shape was used for the model. Its spatial resolution is 25 m. The dynamic wave method was adopted to simulate the flow routing, and the Horton model was selected for the infiltration calculation.

To ensure the accuracy of the constructed model, we selected the measured rainfall and tide level series during Typhoon Rammasun in July 2014 (see Fig. 5) to calibrate the model. The location distribution of observed flooding points is shown in Fig. 7. The comparison between the simulated depth and the observed depth is shown in Table 2, which indicates that the simulated water depth of the model is close to the observed water depth in general. The percentage of data with relative error less than 10% between observed and simulated inundation depths at inundation points is 75%, and the ratio of data with relative error less than or equal to 15% is 87.5%. The value of NSE is 0.725. It indicates that the model can simulate the inundation of the study area well.

Fig. 7
figure 7

Distribution of observed flooded points in the Haidian Island study area on Hainan Island, China

Table 2 Comparison of the observed and simulated water depths in the Haidian Island study area on Hainan Island, China

4.2 Rapid Prediction Model of Urban Floods Based on the Light Gradient Boosting Machine (LightGBM) Model

In this part, we first prepare the data required to construct the LightGBM model, optimize the hyperparameters of the LightGBM model using K-fold cross validation and grid search algorithms, and then derive the prediction results on the test set and during Typhoon Rammasun in July 2014 based on the LightGBM model. Finally, the accuracy of the model is analyzed according to the model accuracy evaluation index.

4.2.1 Data Preparation

Instead of considering the complicated physical processes that cause floods, this study rapidly predicts the maximum water depth based on the patterns learned by the LightGBM model. We selected 10 flooded points in the study area. These points are flood-prone areas identified by the Haikou Municipal Water Bureau and are distributed evenly on Haidian Island. We extracted 10 input variables from the rainfall and tide level series for the LightGBM model. The six variables related to rainfall are cumulative rainfall, rainfall return period, rainfall peak, maximum 2h rainfall, maximum 3h rainfall, and cumulative rainfall before the peak. The four variables related to tide level are maximum tide level, return period of tide level, average tide level, and maximum 5h average tide level. The location of the inundation point was also used as an input variable to represent different points. The maximum water depth at each flooded point was simulated under different scenarios of rainfall-tide level combinations as the output variable of the LightGBM model. To augment the diversity of the data and prevent overfitting of the model, we added noise obeying Gaussian distribution to the data with a ratio of the noise of 0.01 (Kim and Han 2020). After adding Gaussian noise to the data, the dataset was normalized using Z-score to eliminate the magnitude effect of different data (Jain et al. 2005). Overall, a dataset of 490 examples was obtained by simulating 49 scenarios of rainfall-tide level combinations using the PCSWMM. There are 11 input variables and 1 output variable in each sample. Table 3 details the inputs and output of the LightGBM model and PCSWMM.

Table 3 The inputs and output of the light gradient boosting machine (LightGBM) model and the personal computer storm water management model (PCSWMM)

The dataset needed to be divided into a training set and a test set before constructing the LightGBM model. For the test set, we selected five sets of sample data with representative combined scenarios of rainfall-tide level. The return periods of rainfall and tide levels for the five data sets are the same, that is, 5 years, 20 years, 35 years, 75 years, and 100 years, respectively. The rest of the data are the training set. There are 440 data in the training set, and 50 data in the test set.

4.2.2 Hyperparameter Optimization Results

The hyperparameters of the machine learning model affect the prediction of the model, and the optimization of the hyperparameters contributes to precise prediction. This study performed an optimization search for the hyperparameters of the LightGBM model within a given range using 10-fold cross-validation and grid search algorithms.

The three hyperparameters of the LightGBM model were used as the optimization objectives of the grid search algorithm. The default value of the learning rate is 0.1, which controls the gradient descent speed of the model. If the learning rate is too small, the training time will be too long. Conversely, if the learning rate is too large, the model accuracy will decrease. The default value for the number of estimators is 100. An excessive number of estimators can lead to overfitting. A too small number of estimators will make the model fit insufficient. The default value for the number of leaves is 31. The possibility of overfitting will be significantly increased when the number of leaves is large, while a small number of leaves will cause a large error in model training. To optimize the hyperparameters of the LightGBM model, we first defined the hyperparameter range of the model. The range and interval of these hyperparameters form a three-dimensional hyperparameter grid. The grid search algorithm is used to traverse each node in the hyperparametric grid. We evaluated the generalization ability of each hyperparameter set according to the RMSE by using a 10-fold cross-validation method. Then, we chose the optimal values of the hyperparameters that have the best generalization performance, as shown in Table 4.

Table 4 The optimization range and optimum values of the light gradient boosting machine (LightGBM) model hyperparameters based on grid search (GS) algorithm

4.2.3 Maximum Water Depth Prediction Based on the Light Gradient Boosting Machine (LightGBM) Model

Python 3.7 was used to develop the machine learning model. The platform for writing the program is the web-based Google Colab notebook.Footnote 4

Based on the optimal hyperparameter set, we constructed the LightGBM model to make predictions on the test set, and we obtained the maximum water depth at each inundation point. Figure 8 shows the predicted maximum water depth in the test phase. The predictions follow the one-to-one line and show a good predictive performance. Figure 9 shows the predictions made by the LightGBM model for the 10 inundation points in the test set. The predicted values show a strong consistency with the simulated values.

Fig. 8
figure 8

Predicted depth of the light gradient boosting machine (LightGBM) model under different return periods: a 5, b 20, c 35, d 75, and e 100 years

Fig. 9
figure 9

The light gradient boosting machine (LightGBM) model prediction results for 10 flood-prone points under different return periods

In addition, we used rainfall and tide levels during Typhoon Rammasun as inputs for the LightGBM model and obtained the predicted water depths at the 10 flooded points. The simulated water depth is obtained by inputting the rainfall and tide level into the calibrated PCSWMM model. The predicted water depth of the LightGBM model is compared with the simulated water depth of the PCSWMM in the form of a bar diagram, as shown in Fig. 10. The result shows that the LightGBM model makes a good prediction of the inundation depth during Typhoon Rammasun.

Fig. 10
figure 10

Comparison of simulated water depth from PCSWMM with predicted water depth from the light gradient boosting machine (LightGBM) model at flooded points in the Haidian Island study area on Hainan Island, China during Typhoon Rammasun on 18 July 2014

4.2.4 Model Accuracy Analysis

To assess the performance of the model at each phase, we evaluated the constructed LightGBM using RMSE, MAPE, SSE, and NSE. The evaluation results are shown in Table 5.

Table 5 Predictive capability of the light gradient boosting machine (LightGBM) model

The aim of the LightGBM model is to make accurate predictions on unknown data, and the evaluation results in the test phase are crucial to measuring the performance of the model. If the evaluation results of the training phase are significantly better than those of the testing phase, the model has been overfitted. In the results of this study, the RMSE, MAPE, SSE, and NSE of the LightGBM model during the test phase are 0.0292, 0.0380, 0.6827, and 0.9896, respectively. It demonstrates that the LightGBM model is feasible in water depth prediction and the degree of overfitting is acceptable.

The distribution of the simulated water depth of PCSWMM and the predicted water depth of the LightGBM model in the study area on the test set is shown in Fig. 11. The simulated water depth is very close to the predicted water depth. To verify whether the LightGBM model can accomplish real-time flood prediction, we provide the runtimes required by LightGBM and PCSWMM to predict water depth under the same scenario in Table 6. For the LightGBM model, it takes only 35 seconds (including the training time of the model) to generate the water depth at the flooded points. Drawing a completely reliable comparison between the computational speed of the two models is difficult since they use different computing devices. However, the values in Table 6 can still demonstrate that the LightGBM model is computationally efficient and can substitute the physical model.

Fig. 11
figure 11

Spatial distribution of simulated and predicted water depths in the Haidian Island study area on Hainan Island, China when the return periods of both rainfall and tide level are a 5 years, b 20 years, c 35 years, d 75 years, and e 100 years

Table 6 The runtimes of the light gradient boosting machine (LightGBM) and personal computer storm water management model (PCSWMM) models for predicting water depth

4.3 Variable Importance

The importance of the 11 input variables of the LightGBM model for coastal urban flooding was analyzed by the Gini index (Sandri and Zuccolotto 2008; Nembrini et al. 2018). The results (Fig. 12) show that the importance of the maximum tide level is 14.62%, which is the most important flood-causing factor, followed by the average tide level (12.48%), the return period of the tide level (11.61%), and the maximum 5h average tide level (11.36%), all with a relative importance greater than 10%. The relative importance of the remaining seven flood-causing factors is greater than 5%: rainfall peak (8.38%), cumulative rainfall (8.09%), location of flooded point (7.93%), maximum 3h rainfall (7.23%), cumulative rainfall before peak (7.02%), rainfall return period (6.20%), and maximum 2h rainfall (5.07%). The result reveals that the relative importance of flood-causing factors related to tide levels is higher than those related to rainfall. The reason for this is that Haidian Island is surrounded by the sea and is affected by the tide jacking (Olbert et al. 2017; Dai and Cai 2021). This increases the risk of urban flooding.

Fig. 12
figure 12

Relative importance of urban flood conditioning factors based on the light gradient boosting machine (LightGBM) model

4.4 Comparison between the Results of the Light Gradient Boosting Machine (LightGBM) Model and Other Machine Learning Models

To verify the superiority of the LightGBM model, we compared the prediction performance of LightGBM with RF, XGBoost, and KNN. We made separate predictions for each model on the test set and plotted the predicted water depth at each flooded point. The prediction results of the LightGBM, RF, XGBoost, and KNN models are shown in Fig. 13. The prediction results of the LightGBM model in Fig. 10 show that the model can predict the maximum water depth more accurately for flooded point 1, which is difficult to predict accurately by other models. The LightGBM model made a good prediction for other flooded points.

Fig. 13
figure 13

The prediction results of each model for the 10 flood-prone points in the Haidian Island study area on Hainan Island, China under different return periods

The scatter distribution of predicted versus simulated water depths for LightGBM, RF, XGBoost, and KNN over the entire test set is shown in Fig. 14. Visually, the concentration of the scatter distribution of LightGBM is close to RF and XGBoost, while the data points of the KNN model are the most scattered. In addition, we trained the LightGBM, RF, XGBoost, and KNN models on the same training set and tested the performance of each model at each phase separately. The values for each evaluation index are shown in Table 7. The result shows that LightGBM outperformed the other models in terms of all metric values in the training and testing phases.

Fig. 14
figure 14

Scatter distribution of predicted and simulated values for each machine learning model

Table 7 Comparison of the evaluation index values of each model in the prediction stage

To compare the prediction effectiveness of each machine learning model during Typhoon Rammasun, we use rainfall and tide level during the typhoon as input conditions. The predicted water depths of each machine learning model were obtained and plotted in Fig. 15 along with the simulated water depth from the PCSWMM. The result suggests that the LightGBM model made the best prediction under this scenario.

Fig. 15
figure 15

Comparison of simulated water depth from the personal computer storm water management model (PCSWMM) with predicted water depth from light gradient boosting machine (LightGBM), random forest (RF), extreme gradient boosting (XGBoost), and k-nearest neighbor (KNN) at the 10 flooded points in the Haidian Island study area on Hainan Island, China during Typhoon Rammasun on 18 July 2014

5 Conclusion

Simulation and prediction of urban flooding using the machine learning approach will remain one of the hot topics and challenges in future research on urban flood control. This study constructed a rapid prediction model for urban flooding by integrating a LightGBM model and 1D-2D coupled hydrological–hydraulic model. A total of 49 combined rainfall-tide scenarios were simulated based on the PCSWMM to provide data for the construction of the LightGBM model. The 11 feature variables were extracted from rainfall, tide level, and location of the flooded points as the input of the LightGBM model, and the maximum inundation depth of the flood-prone points was the output of the model. K-fold cross-validation and grid search algorithms were used to optimize the hyperparameter set of the LightGBM model. The optimum values of learning rate, number of estimators, and number of leaves of the LightGBM model are 0.11, 450, and 12, respectively. The NSE of the LightGBM model on the test is 0.9896, which outperforms RF, XGBoost, and KNN. The results show that the LightGBM model can effectively predict the water depths of the 1D-2D coupled hydrological–hydraulic model. From the LightGBM model, the variables related to tide level were analyzed as the dominant variables for predicting the inundation depth based on the Gini index. Due to the superior performance and computational efficiency, the proposed LightGBM model provides time for decision makers to take action to mitigate flooding.

However, further improvements are needed due to the limitation of the data. In this study, we considered the worst case scenario and used the observed rainfall and tide level data of Typhoon Rammasun to obtain the design rainfall and tide levels under different return periods. There is no consideration of the effect of rainfall patterns, which has an impact on the generalization of the LightGBM model. More factors (such as rainfall patterns, distance to rivers, and so on) can be considered to further improve the model’s predictive ability in our future work. The physical model and data-driven model are the two main flood forecasting approaches. Physical models are in principle directly interpretable and offer the potential of extrapolation beyond observed conditions, whereas data-driven approaches are highly flexible in adapting to data and are amenable to finding unexpected patterns. In the future, the hybrid modeling approach coupling physical process models with data-driven machine learning should be further investigated to improve flood forecasting capabilities.