Rapid Prediction Model for Urban Floods Based on a Light Gradient Boosting Machine Approach and Hydrological–Hydraulic Model

Xu, Kui; Han, Zhentao; Xu, Hongshi; Bin, Lingling

doi:10.1007/s13753-023-00465-2

Rapid Prediction Model for Urban Floods Based on a Light Gradient Boosting Machine Approach and Hydrological–Hydraulic Model

Article
Open access
Published: 09 February 2023

Volume 14, pages 79–97, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Disaster Risk Science Aims and scope Submit manuscript

Rapid Prediction Model for Urban Floods Based on a Light Gradient Boosting Machine Approach and Hydrological–Hydraulic Model

Download PDF

Kui Xu¹,
Zhentao Han¹,
Hongshi Xu² &
…
Lingling Bin³

3578 Accesses
21 Citations
Explore all metrics

Abstract

Global climate change and sea level rise have led to increased losses from flooding. Accurate prediction of floods is essential to mitigating flood losses in coastal cities. Physically based models cannot satisfy the demand for real-time prediction for urban flooding due to their computational complexity. In this study, we proposed a hybrid modeling approach for rapid prediction of urban floods, coupling the physically based model with the light gradient boosting machine (LightGBM) model. A hydrological–hydraulic model was used to provide sufficient data for the LightGBM model based on the personal computer storm water management model (PCSWMM). The variables related to rainfall, tide level, and the location of flood points were used as the input for the LightGBM model. To improve the prediction accuracy, the hyperparameters of the LightGBM model are optimized by grid search algorithm and K-fold cross-validation. Taking Haidian Island, Hainan Province, China as a case study, the optimum values of the learning rate, number of estimators, and number of leaves of the LightGBM model are 0.11, 450, and 12, respectively. The Nash-Sutcliffe efficiency coefficient (NSE) of the LightGBM model on the test set is 0.9896, indicating that the LightGBM model has reliable predictions and outperforms random forest (RF), extreme gradient boosting (XGBoost), and k-nearest neighbor (KNN). From the LightGBM model, the variables related to tide level were analyzed as the dominant variables for predicting the inundation depth based on the Gini index in the study area. The proposed LightGBM model provides a scientific reference for flood control in coastal cities considering its superior performance and computational efficiency.

Machine learning algorithm for flash flood prediction mapping in Wadi El-Laqeita and surroundings, Central Eastern Desert, Egypt

Article 15 February 2021

Classification machine learning models for urban flood hazard mapping: case study of Zaio, NE Morocco

Article 16 April 2024

Integration of extreme gradient boosting feature selection approach with machine learning models: application of weather relative humidity prediction

Article 13 August 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Global climate change is increasingly affecting rainfall intensity, making extreme rainfall events more frequent. In addition, urbanization has a significant impact on the rainfall-runoff relationship (Ferguson and Suckling 1990; Aronica et al. 2012). Floods are the most common natural hazard-related disaster in the world (Berz 2000) with severe social and economic damages. Floods are more severe in coastal cities with developed economies that are affected by rainfall and tide level. With climate change and sea level rise, the risk of coastal floods continues to increase. Because many coastal cities are low-lying and have poor drainage systems for rainwater, they are vulnerable to flooding. For example, roughly 4354 people were missing or died and about USD 70.6 billion in economic losses were incurred during coastal flood disasters in China between 1989 and 2014 (Fang et al. 2017). Rapid flood prediction gives urban decision makers sufficient time to take flood prevention measures to mitigate social and economic losses (Zevenbergen et al. 2008).

In the last few decades, the study of prediction models for urban floods has made significant progress, and the models can be divided into physical models and data-driven models. The physical models are mainly hydrological–hydraulic models, with relatively mature technical development and high computational accuracy. In recent years, with the development of numerical simulation technology, hydrological–hydraulic models have been widely used in urban flood simulation (Bates et al. 2010; Frank et al. 2011; Yamazaki et al. 2011). Sidek et al. (2021) constructed an urban flood simulation model based on InfoWorks ICM, an advanced integrated catchment modeling software, which could effectively predict the depth and extent of floods. Wu et al. (2017) developed a two-dimensional flood inundation model to simulate dynamic urban flooding by coupling a storm water management model (SWMM) and LISFLOOD-FP.^{Footnote 1} It reveals the evolution of urban flooding under different storms, sea level rise, and ground subsidence scenarios in the future. Thorndahl et al. (2016) built an integrated flood and drainage system model based on MIKE FLOOD.^{Footnote 2} This model is calculated using only a two-dimensional surface model, and generates detailed flood maps with high resolution radar rainfall data. However, the complexity of hydrological–hydraulic models and longer computational time limit the development of models (Liu et al. 2015; Huang et al. 2017; Zanchetta and Coulibaly 2020). Particularly for study areas with a large spatial extent or high spatial resolution of data, the computational time of hydrological–hydraulic models will be much longer (Bermúdez et al. 2018). For flood simulation studies in urban areas, a large spatial extent and high spatial resolution are required. Even the fastest physical model simulation methods still consume a large computational resource, and it is difficult to meet the demand for real-time flood prediction (Bhola et al. 2018). Moreover, hydrological–hydraulic models have high requirements for flood observed data and need to calibrate a large number of parameters to describe flooding processes accurately. The significant lack of flood observed data has affected its generalization and application.

Data-driven models are also commonly used in the field of urban flood prediction. These models identify patterns between inputs and outputs by learning based on a large amount of data. Such models can give model outputs directly based on model inputs, thus achieving rapid flood prediction. Ma et al. (2021) generated risk maps of flash floods by proposing a model for flash flood risk assessment based on extreme gradient boosting (XGBoost) at the county level in Yunnan Province, China. Wu et al. (2020) constructed a fast prediction model for urban floods based on the gradient boosted decision trees (GBDT) algorithm to predict the maximum water depth of urban inundation points. Sadler et al. (2018) proposed to train a random forest (RF) algorithm by using large amounts of environmental data to assess the severity of flooding in coastal cities. In these studies, a large amount of flooding data is required for data-driven models to construct effective flood prediction models. But in many regions, data-driven models are limited in application due to the scarcity of measured data.

One feasible solution is to couple a physical model with a data-driven model (Reichstein et al. 2019). The physical model is used to simulate lots of flooding data to train the data-driven model. For example, Kabir et al. (2020) combined LISFLOOD-FP and a convolutional neural network (CNN) for fast prediction of water depth of river floods. Berkhahn et al. (2019) constructed a rapid urban flood prediction model based on integrated neural networks using the HYSTEM-EXTRAN^{Footnote 3} 2D model to generate a database of flood maps. Löwe et al. (2021) proposed a method that integrated MIKE FLOOD and CNN to predict water depth of urban river floods. However, the pre-sort-based algorithms, level-wise growth tree strategy, and one-hot encoding used by boosting tools in the above studies resulted in larger memory usage and slower training speed. The light gradient boosting machine (LightGBM) is a machine learning framework based on GBDT originally proposed by Microsoft in 2017. The histogram-based algorithm used by the LightGBM model reduces memory usage and speeds up training. The leaf-wise growth tree strategy used by the LightGBM model enables the model to have higher accuracy. Moreover, the LightGBM model supports input of categorical features directly, which simplifies data preprocessing. Because of these merits, the LightGBM model is extremely suitable as a machine learning algorithm for rapid simulation of urban flooding. It has been widely applied in many fields such as rainfall correction (Lee et al. 2019) and forest fire analysis (Nguyen et al. 2022), but there are few studies related to urban floods (Cui et al. 2021).

The objectives of this study are threefold: (1) An approach that couples the LightGBM model and hydrological–hydraulic model is proposed for rapid prediction of flood inundation, where the hyperparameters of the LightGBM model are optimized by grid search algorithm and K-fold cross-validation. (2) The performance of the LightGBM model is compared with the XGBoost, RF, and k-nearest neighbor (KNN) models. (3) The importance of rainfall, tide level, and other input variables for coastal urban flooding is quantified based on the LightGBM model and the Gini index. Section 2 presents the basic methodological theory used in the study. Section 3 describes the study area and data. Section 4 describes the rapid prediction model for urban floods based on the LightGBM model. A brief conclusion and potential improvements in the future are provided in Sect. 5.

2 Methods

This section summarizes the study framework of the rapid prediction model for urban flooding, introduces the theory of one- and two-dimensional coupled hydrological–hydraulic model, light gradient boosting machine, K-fold cross-validation, and grid search algorithm, and selects four model accuracy evaluation metrics for this study.

2.1 Study Framework

In this study, we adopted the LightGBM model coupled with a physical model to construct a rapid prediction model for urban floods. The flowchart of the proposed prediction model is shown in Fig. 1. First, we designed 49 combined scenarios of rainfall-tide levels. A one- and two-dimensional coupled hydrological–hydraulic model based on the personal computer storm water management model (PCSWMM) was constructed to simulate the inundation distribution of flooding in the study area. Next, we extracted 10 variables from the rainfall series and tide level series and used them together with the location of flood-prone points as input variables for the machine learning model. A LightGBM model optimized by K-fold cross-validation and grid search algorithm was proposed to predict the maximum water depth at flooded points in urban floods. Then the constructed LightGBM model and other benchmark models were evaluated and compared on the test set based on the evaluation metrics. Finally, we determined the relative importance of the variables that cause urban flooding by calculating the Gini index.

2.2 One- and Two-Dimensional Coupled Hydrological–Hydraulic Model Based on the Personal Computer Storm Water Management Model (PCSWMM)

The PCSWMM is a hydrological–hydraulic model developed by the Computational Hydraulic Institute (CHI) of Canada based on the storm water management model (SWMM), which is widely used in studies of drainage networks and management of urban storm water (Ahiablame and Shakya 2016; Wu and Huang 2016). We constructed a flood simulation model based on the PCSWMM to provide data for the machine learning model.

The calculations of the PCSWMM include surface runoff, flow concentration, flow routing, and 1D-2D coupled simulation. The surface runoff is calculated by the Manning formula:

$$Q=\frac{W}{n}{\left(d-{d}_{p}\right)}^{5/3}\cdot {S}^{1/2}$$

(1)

Where Q is the outflow rate (m³/s); W is the width of the subcatchment (m); n is the surface Manning factor; d is the water depth of the reservoir (m); d_p indicates the maximum depression depth (m); and S is the average slope of the subcatchment.

The nonlinear reservoir method is used for calculating flow concentration. It is calculated by associating the continuity equation and the Manning formula, where the continuity equation is:

$$\frac{{\text{d}}V}{{\text{d}}t}=A\frac{{\text{d}}d}{{\text{d}}t}=A{i}^{*}-Q$$

(2)

Where V is the total volume of water in the catchment (m³); d is the water depth (m); t is the time (s); A is the area of the subcatchment (m²); i^* is the net rainfall intensity (m/s); and Q is the surface runoff (m³/s).

The flow routing is calculated using the completely dynamic wave method, and the continuity (Eq. 3) and momentum (Eq. 4) are:

$$\frac{\partial Q}{\partial x}+\frac{\partial A}{\partial t}=0$$

(3)

$$gA\frac{\partial H}{\partial x}+\frac{\partial \left({Q}^{2}/A\right)}{\partial x}+\frac{\partial Q}{\partial t}+gA{S}_{f}=0$$

(4)

Where Q is the flow (m³/s); x is the distance (m); A is the cross-sectional area (m²); t is the time (s); H is the water depth (m); G is the acceleration of gravity (m/s²); and S_f is the friction slope.

2.3 The Light Gradient Boosting Machine (LightGBM) Model for Urban Flooding Prediction

The light gradient boosting machine is a machine learning framework based on GBDT. Compared with other tree-based learning algorithms, the LightGBM model has significant improvements in training speed, computational efficiency, and prediction accuracy. The model is particularly suitable for regression, classification, sorting, and other machine learning tasks (Ke et al. 2017).

The LightGBM model uses a histogram-based algorithm (Fig. 2a) for decision tree learning (Ranka and Singh 1998; Li et al. 2007), which buckets the continuous feature values into k bins, and stores them in a histogram of width k. This has great advantages in terms of training speed and memory usage.

On the basis of the histogram algorithm, the LightGBM model is further optimized by changing the strategy of tree growth. Unlike most tree-based learning algorithms that use a level-wise tree growth strategy, the LightGBM model grows trees leaf-wise with a depth limit. Figure 2b (i) shows the schematic diagram of level-wise tree growth. The level-wise tree growth strategy can split the leaves of the same layer at the same time as the decision tree grows, but the gain of information obtained from splitting the leaves at a specific layer is different from the gain of information obtained from splitting the other leaves at the same layer (Liang et al. 2020). The processing of leaves with low gain of information provides little improvement in prediction accuracy, and increases memory consumption. Figure 2b (ii) illustrates the leaf-wise growth strategy. The leaf-wise growth tree splits only the leaf with the greatest gain of information, and leaf-wise growth strategy tend to achieve lower loss than level-wise growth strategy. But it may lead to overfitting, so we need to limit tree depth (Cui et al. 2021).

In addition, the LightGBM model supports the input of category features directly. Most other machine learning algorithms require one-hot encoding to represent categorical features. This is not optimal for tree-based learning algorithms. It causes the data to become sparse and reduces the computational efficiency.

2.4 K-Fold Cross-Validation and Grid Search Algorithms

The goodness of fit of the training set and the prediction accuracy of the test set for a given machine learning algorithm are mainly determined by its hyperparameters. For example, the learning rate, the number of estimators, and the number of leaves in the LightGBM model all affect the reliability of the model directly. To maximize the performance of the LightGBM model, the hyperparameters of the LightGBM model are optimized using K-fold cross-validation and grid search algorithms based on the open source machine learning framework named scikit-learn (Varoquaux et al. 2015).

K-fold cross-validation is a type of non-exhaustive cross-validation (Stone 1974). It can be used effectively to evaluate the generalization ability of a machine learning model. K-fold cross-validation divides the data set into k subsets randomly and evenly as shown in Fig. 2c. For each training and testing of the model, k-1 subsets are used for training the model, leaving one subset as the test set and the training and testing are repeated k times to let each subset be used as a test set for one time. Finally, the results of k tests are averaged as the model evaluation value. In general, k is assigned to 10 (McLachlan et al. 2004), which means 10-fold cross-validation is used as the evaluation method of performance during model training.

Grid search is an exhaustive algorithm widely used for hyperparameter optimization of machine learning models (Wang et al. 2013; Pontes et al. 2016; Ogunleye and Wang 2019). Grid search can traverse different combinations of hyperparameters within a given search range, as shown in Fig. 2d. The basic idea of the grid search strategy is to find the optimal set from the first dimension to the n-th dimension, respectively. When searching for the optimal set from the i-th dimension, the values of other dimensions are kept constant. Then the performance of models under different sets of hyperparameters is compared by the K-fold cross-validation approach, and thus determines the most suitable model hyperparameters for the current training data (Liu et al. 2014).

2.5 Model Accuracy Evaluation Index

A set of evaluation metrics of model accuracy is required to assess the performance of a model at each phase of model construction. In this study, we used root mean square error (RMSE), mean absolute percentage error (MAPE), sum of squared errors (SSE), and Nash-Sutcliffe efficiency coefficient (NSE) to evaluate the performance of the model.

The RMSE is the standard deviation of the residuals, as shown in Eq. 5. A smaller RMSE value indicates better model quality. The MAPE expresses the prediction accuracy as a ratio defined by Eq. 6. The ratio is the average rate of deviation of predicted values from observed values. A lower MAPE value represents higher predictive accuracy of the model. The SSE is the sum of squared errors between the predicted and observed values, as shown in Eq. 7. The closer SSE is to 0, the better the model prediction is. The NSE (Nash and Sutcliffe 1970) is commonly used to evaluate the simulation results of hydrological models, as shown in Eq. 8. The model with NSE value closer to 1 has better predictive ability than the model with NSE value closer to 0.

$$RMSE=\sqrt{\frac{1}{n}{\sum }_{i=1}^{n}{\left({H}_{a}^{i}-{H}_{f}^{i}\right)}^{2}}$$

(5)

$$MAPE=\frac{100\mathrm{\%}}{n}{\sum }_{i=1}^{n}\left|\frac{{H}_{a}^{i}-{H}_{f}^{i}}{{H}_{a}^{i}}\right|$$

(6)

$$SSE={\sum }_{i=1}^{n}{\left({H}_{a}^{i}-{H}_{\rho }^{i}\right)}^{2}$$

(7)

$$NSE=1-\frac{{\sum }_{i=1}^{n}{\left({H}_{a}^{i}-{H}_{f}^{i}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({H}_{a}^{i}-\overline{{H}_{a}}\right)}^{2}}$$

(8)

Where ${H}_{a}^{i}$ and ${H}_{f}^{i}$ are the real and predicted values of the i-th sample, respectively; $\overline{{H}_{a}}$ is the average value of the observed data; and n is the number of examples.

3 Study Area and Data

This section provides a detailed description of the study area and the required data.

3.1 Study Area

Haikou, the capital city of Hainan Province, is located at the northernmost corner of Hainan Island (Fig. 3). It is separated from Guangdong Province by the Qiongzhou Strait, and is located between ${19}^{\circ }{31}^{\mathrm{^{\prime}}}{32}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}-{20}^{\circ }{04}^{\mathrm{^{\prime}}}{52}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}$ North latitude and ${110}^{\circ }{07}^{\mathrm{^{\prime}}}{22}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}-{110}^{\circ }{42}^{\mathrm{^{\prime}}}{32}^{\mathrm{^{\prime}}\mathrm{^{\prime}}}$ East longitude. The city is the economic and cultural center of Hainan Province.

Haikou is in the tropical climate zone. It has a tropical seasonal climate with high average temperatures throughout the year. Rainfall is plentiful due to frequent typhoons. Precipitation mainly occurs in May to October, and the average annual precipitation is 1827 mm. The dense river and water systems in Haikou have a key function for the flood control of the city.

Haidian Island is separated from the main city by two rivers (see Fig. 3) and is the largest island in Haikou with an area of about 14 km². The urban drainage system and river channels together form the flood control and drainage system of Haidian Island. The flat topography and the jacking effect of seawater on the drainage outlet of the island result in poor drainage capacity, thus the island is susceptible to seawater backflow.

3.2 Data

We selected Haidian Island as the case study area and constructed a 1D-2D coupled flood simulation model based on the PCSWMM. The required data include the digital elevation model (DEM), drainage network, 1D inspection wells, subcatchment area, and water blocking obstructions. The DEM (Fig. 4a) data were obtained from the elevation data of Haidian Island provided by the Haikou Municipal Water Bureau. Drainage network and 1D inspection wells information were also provided by the Haikou Municipal Water Bureau. The data were preprocessed using ArcGIS software to get the 1D inspection wells and pipe network in the study area (Fig. 4b). According to the river system in the study area, each river section is generalized into different channels. The study area was divided into 13 subcatchment areas, as shown in Fig. 4c, by combining the topographic and geomorphic features, drainage pipes, and roads layout. The water blocking obstacles are mainly buildings in the urban area. Obstacles in the study area were identified by identifying buildings on satellite images of urban areas (Fig. 4d). We used the observed water depth data to calibrate the 1D-2D coupled flood simulation model. The observed water depth data were collected partly using a water level ruler through field investigation and partly provided by the Haikou Municipal Water Bureau during Typhoon Rammasun in July 2014.

After constructing the flood simulation model using the above data, we needed to input the boundary conditions into the model to simulate the data of flood distribution. The generalized extreme value (GEV) function was adopted to fit the distributions of rainfall and tide level (Xu et al. 2019; Xu et al. 2022). The parameters of the GEV function were estimated by the maximum likelihood method. The rainfall distribution on Haidian Island was described by the GEV function with location parameter ξ = 122.51, scale parameter α = 46.78, and shape parameter k = 0.115. The tide level distribution was also described by the GEV function with location parameter α = 2.349, scale parameter α = 0.380, and shape parameter k = − 0.048. Then, the design rainfall and tide peaks for seven return periods from 5 years to 100 years were calculated by the GEV function (Table 1). The rainfall event on 18 July 2014 was taken as a typical rainfall and together with the tide level process on the same day, it created the most severe flooding scenario (Fig. 5). Finally, the typical rainfall and tide level processes are scaled up by the same magnitude factors using the same ratio method since it is easy to calculate and can maintain the shape of the typical rainfall and tide level processes, and the magnitude factors were determined by the rainfall and tide peak. The magnification ratios obtained by this method are shown in Table 1. Thus, we obtained the temporal distribution of the designed rainfall (Fig. 6a) and tide level (Fig. 6b) for different return periods. The duration of both rainfall and tide levels were selected to be 24 hours, with a temporal resolution of 1 hour. We combined rainfall and tide level data to obtain 49 scenarios as boundary conditions for the flood simulation model.

Table 1 Magnification ratios obtained by the same ratio method

Full size table

4 Results

This section demonstrates the results of this study, which include the urban flooding model by PCSWMM, the rapid prediction model of urban floods based on LightGBM, comparison of results between LightGBM and other machine learning models, and the variable importance.

4.1 Establishment of an Urban Flooding Model by Using the Personal Computer Storm Water Management Model (PCSWMM)

To build a 1D-2D coupled flood simulation model, the 1D drainage model with 2,667 inspection wells and 2,042 conduits was first constructed. Then we constructed the 2D model of surface inundation based on the 2D nodes, 2D grids, and DEM. Finally, we performed the coupling of the 1D-2D models by connecting the 1D drainage model and the 2D surface inundation model using the bottom orifice connection. A grid type with a hexagonal shape was used for the model. Its spatial resolution is 25 m. The dynamic wave method was adopted to simulate the flow routing, and the Horton model was selected for the infiltration calculation.

To ensure the accuracy of the constructed model, we selected the measured rainfall and tide level series during Typhoon Rammasun in July 2014 (see Fig. 5) to calibrate the model. The location distribution of observed flooding points is shown in Fig. 7. The comparison between the simulated depth and the observed depth is shown in Table 2, which indicates that the simulated water depth of the model is close to the observed water depth in general. The percentage of data with relative error less than 10% between observed and simulated inundation depths at inundation points is 75%, and the ratio of data with relative error less than or equal to 15% is 87.5%. The value of NSE is 0.725. It indicates that the model can simulate the inundation of the study area well.

Table 2 Comparison of the observed and simulated water depths in the Haidian Island study area on Hainan Island, China

Full size table

4.2 Rapid Prediction Model of Urban Floods Based on the Light Gradient Boosting Machine (LightGBM) Model

In this part, we first prepare the data required to construct the LightGBM model, optimize the hyperparameters of the LightGBM model using K-fold cross validation and grid search algorithms, and then derive the prediction results on the test set and during Typhoon Rammasun in July 2014 based on the LightGBM model. Finally, the accuracy of the model is analyzed according to the model accuracy evaluation index.

4.2.1 Data Preparation

Instead of considering the complicated physical processes that cause floods, this study rapidly predicts the maximum water depth based on the patterns learned by the LightGBM model. We selected 10 flooded points in the study area. These points are flood-prone areas identified by the Haikou Municipal Water Bureau and are distributed evenly on Haidian Island. We extracted 10 input variables from the rainfall and tide level series for the LightGBM model. The six variables related to rainfall are cumulative rainfall, rainfall return period, rainfall peak, maximum 2h rainfall, maximum 3h rainfall, and cumulative rainfall before the peak. The four variables related to tide level are maximum tide level, return period of tide level, average tide level, and maximum 5h average tide level. The location of the inundation point was also used as an input variable to represent different points. The maximum water depth at each flooded point was simulated under different scenarios of rainfall-tide level combinations as the output variable of the LightGBM model. To augment the diversity of the data and prevent overfitting of the model, we added noise obeying Gaussian distribution to the data with a ratio of the noise of 0.01 (Kim and Han 2020). After adding Gaussian noise to the data, the dataset was normalized using Z-score to eliminate the magnitude effect of different data (Jain et al. 2005). Overall, a dataset of 490 examples was obtained by simulating 49 scenarios of rainfall-tide level combinations using the PCSWMM. There are 11 input variables and 1 output variable in each sample. Table 3 details the inputs and output of the LightGBM model and PCSWMM.

Table 3 The inputs and output of the light gradient boosting machine (LightGBM) model and the personal computer storm water management model (PCSWMM)

Full size table

The dataset needed to be divided into a training set and a test set before constructing the LightGBM model. For the test set, we selected five sets of sample data with representative combined scenarios of rainfall-tide level. The return periods of rainfall and tide levels for the five data sets are the same, that is, 5 years, 20 years, 35 years, 75 years, and 100 years, respectively. The rest of the data are the training set. There are 440 data in the training set, and 50 data in the test set.

4.2.2 Hyperparameter Optimization Results

The hyperparameters of the machine learning model affect the prediction of the model, and the optimization of the hyperparameters contributes to precise prediction. This study performed an optimization search for the hyperparameters of the LightGBM model within a given range using 10-fold cross-validation and grid search algorithms.

The three hyperparameters of the LightGBM model were used as the optimization objectives of the grid search algorithm. The default value of the learning rate is 0.1, which controls the gradient descent speed of the model. If the learning rate is too small, the training time will be too long. Conversely, if the learning rate is too large, the model accuracy will decrease. The default value for the number of estimators is 100. An excessive number of estimators can lead to overfitting. A too small number of estimators will make the model fit insufficient. The default value for the number of leaves is 31. The possibility of overfitting will be significantly increased when the number of leaves is large, while a small number of leaves will cause a large error in model training. To optimize the hyperparameters of the LightGBM model, we first defined the hyperparameter range of the model. The range and interval of these hyperparameters form a three-dimensional hyperparameter grid. The grid search algorithm is used to traverse each node in the hyperparametric grid. We evaluated the generalization ability of each hyperparameter set according to the RMSE by using a 10-fold cross-validation method. Then, we chose the optimal values of the hyperparameters that have the best generalization performance, as shown in Table 4.

Table 4 The optimization range and optimum values of the light gradient boosting machine (LightGBM) model hyperparameters based on grid search (GS) algorithm

Full size table

4.2.3 Maximum Water Depth Prediction Based on the Light Gradient Boosting Machine (LightGBM) Model

Python 3.7 was used to develop the machine learning model. The platform for writing the program is the web-based Google Colab notebook.^{Footnote 4}

Based on the optimal hyperparameter set, we constructed the LightGBM model to make predictions on the test set, and we obtained the maximum water depth at each inundation point. Figure 8 shows the predicted maximum water depth in the test phase. The predictions follow the one-to-one line and show a good predictive performance. Figure 9 shows the predictions made by the LightGBM model for the 10 inundation points in the test set. The predicted values show a strong consistency with the simulated values.

In addition, we used rainfall and tide levels during Typhoon Rammasun as inputs for the LightGBM model and obtained the predicted water depths at the 10 flooded points. The simulated water depth is obtained by inputting the rainfall and tide level into the calibrated PCSWMM model. The predicted water depth of the LightGBM model is compared with the simulated water depth of the PCSWMM in the form of a bar diagram, as shown in Fig. 10. The result shows that the LightGBM model makes a good prediction of the inundation depth during Typhoon Rammasun.

4.2.4 Model Accuracy Analysis

To assess the performance of the model at each phase, we evaluated the constructed LightGBM using RMSE, MAPE, SSE, and NSE. The evaluation results are shown in Table 5.

Table 5 Predictive capability of the light gradient boosting machine (LightGBM) model

Full size table

The aim of the LightGBM model is to make accurate predictions on unknown data, and the evaluation results in the test phase are crucial to measuring the performance of the model. If the evaluation results of the training phase are significantly better than those of the testing phase, the model has been overfitted. In the results of this study, the RMSE, MAPE, SSE, and NSE of the LightGBM model during the test phase are 0.0292, 0.0380, 0.6827, and 0.9896, respectively. It demonstrates that the LightGBM model is feasible in water depth prediction and the degree of overfitting is acceptable.

The distribution of the simulated water depth of PCSWMM and the predicted water depth of the LightGBM model in the study area on the test set is shown in Fig. 11. The simulated water depth is very close to the predicted water depth. To verify whether the LightGBM model can accomplish real-time flood prediction, we provide the runtimes required by LightGBM and PCSWMM to predict water depth under the same scenario in Table 6. For the LightGBM model, it takes only 35 seconds (including the training time of the model) to generate the water depth at the flooded points. Drawing a completely reliable comparison between the computational speed of the two models is difficult since they use different computing devices. However, the values in Table 6 can still demonstrate that the LightGBM model is computationally efficient and can substitute the physical model.

Table 6 The runtimes of the light gradient boosting machine (LightGBM) and personal computer storm water management model (PCSWMM) models for predicting water depth

Full size table

4.3 Variable Importance

The importance of the 11 input variables of the LightGBM model for coastal urban flooding was analyzed by the Gini index (Sandri and Zuccolotto 2008; Nembrini et al. 2018). The results (Fig. 12) show that the importance of the maximum tide level is 14.62%, which is the most important flood-causing factor, followed by the average tide level (12.48%), the return period of the tide level (11.61%), and the maximum 5h average tide level (11.36%), all with a relative importance greater than 10%. The relative importance of the remaining seven flood-causing factors is greater than 5%: rainfall peak (8.38%), cumulative rainfall (8.09%), location of flooded point (7.93%), maximum 3h rainfall (7.23%), cumulative rainfall before peak (7.02%), rainfall return period (6.20%), and maximum 2h rainfall (5.07%). The result reveals that the relative importance of flood-causing factors related to tide levels is higher than those related to rainfall. The reason for this is that Haidian Island is surrounded by the sea and is affected by the tide jacking (Olbert et al. 2017; Dai and Cai 2021). This increases the risk of urban flooding.

4.4 Comparison between the Results of the Light Gradient Boosting Machine (LightGBM) Model and Other Machine Learning Models

To verify the superiority of the LightGBM model, we compared the prediction performance of LightGBM with RF, XGBoost, and KNN. We made separate predictions for each model on the test set and plotted the predicted water depth at each flooded point. The prediction results of the LightGBM, RF, XGBoost, and KNN models are shown in Fig. 13. The prediction results of the LightGBM model in Fig. 10 show that the model can predict the maximum water depth more accurately for flooded point 1, which is difficult to predict accurately by other models. The LightGBM model made a good prediction for other flooded points.

The scatter distribution of predicted versus simulated water depths for LightGBM, RF, XGBoost, and KNN over the entire test set is shown in Fig. 14. Visually, the concentration of the scatter distribution of LightGBM is close to RF and XGBoost, while the data points of the KNN model are the most scattered. In addition, we trained the LightGBM, RF, XGBoost, and KNN models on the same training set and tested the performance of each model at each phase separately. The values for each evaluation index are shown in Table 7. The result shows that LightGBM outperformed the other models in terms of all metric values in the training and testing phases.

Table 7 Comparison of the evaluation index values of each model in the prediction stage

Full size table

To compare the prediction effectiveness of each machine learning model during Typhoon Rammasun, we use rainfall and tide level during the typhoon as input conditions. The predicted water depths of each machine learning model were obtained and plotted in Fig. 15 along with the simulated water depth from the PCSWMM. The result suggests that the LightGBM model made the best prediction under this scenario.

5 Conclusion

Simulation and prediction of urban flooding using the machine learning approach will remain one of the hot topics and challenges in future research on urban flood control. This study constructed a rapid prediction model for urban flooding by integrating a LightGBM model and 1D-2D coupled hydrological–hydraulic model. A total of 49 combined rainfall-tide scenarios were simulated based on the PCSWMM to provide data for the construction of the LightGBM model. The 11 feature variables were extracted from rainfall, tide level, and location of the flooded points as the input of the LightGBM model, and the maximum inundation depth of the flood-prone points was the output of the model. K-fold cross-validation and grid search algorithms were used to optimize the hyperparameter set of the LightGBM model. The optimum values of learning rate, number of estimators, and number of leaves of the LightGBM model are 0.11, 450, and 12, respectively. The NSE of the LightGBM model on the test is 0.9896, which outperforms RF, XGBoost, and KNN. The results show that the LightGBM model can effectively predict the water depths of the 1D-2D coupled hydrological–hydraulic model. From the LightGBM model, the variables related to tide level were analyzed as the dominant variables for predicting the inundation depth based on the Gini index. Due to the superior performance and computational efficiency, the proposed LightGBM model provides time for decision makers to take action to mitigate flooding.

However, further improvements are needed due to the limitation of the data. In this study, we considered the worst case scenario and used the observed rainfall and tide level data of Typhoon Rammasun to obtain the design rainfall and tide levels under different return periods. There is no consideration of the effect of rainfall patterns, which has an impact on the generalization of the LightGBM model. More factors (such as rainfall patterns, distance to rivers, and so on) can be considered to further improve the model’s predictive ability in our future work. The physical model and data-driven model are the two main flood forecasting approaches. Physical models are in principle directly interpretable and offer the potential of extrapolation beyond observed conditions, whereas data-driven approaches are highly flexible in adapting to data and are amenable to finding unexpected patterns. In the future, the hybrid modeling approach coupling physical process models with data-driven machine learning should be further investigated to improve flood forecasting capabilities.

Notes

LISFLOOD-FP is a two-dimensional hydrodynamic model to simulate floodplain inundation, see http://www.bristol.ac.uk/geography/research/hydrology/models/lisflood/.
MIKE FLOOD is a toolbox that includes 1D and 2D flood simulation engines, see https://www.mikepoweredbydhi.com/products/mike-flood.
HYSTEM-EXTRAN is a model for hydrodynamic simulation of sewer systems, see https://itwh.de/en/software-products/desktop/hystem-extran/.
https://colab.research.google.com/

References

Ahiablame, L., and R. Shakya. 2016. Modeling flood reduction effects of low impact development at a watershed scale. Journal of Environmental Management 171: 81–91.
Article Google Scholar
Aronica, G.T., F. Franza, P.D. Bates, and J.C. Neal. 2012. Probabilistic evaluation of flood hazard in urban areas using Monte Carlo simulation. Hydrological Processes 26(26): 3962–3972.
Article Google Scholar
Bates, P.D., M.S. Horritt, and T.J. Fewtrell. 2010. A simple inertial formulation of the shallow water equations for efficient two-dimensional flood inundation modelling. Journal of Hydrology 387(1–2): 33–45.
Article Google Scholar
Berkhahn, S., L. Fuchs, and I. Neuweiler. 2019. An ensemble neural network model for real-time prediction of urban floods. Journal of Hydrology 575: 743–754.
Article Google Scholar
Bermúdez, M., V. Ntegeka, V. Wolfs, and P. Willems. 2018. Development and comparison of two fast surrogate models for urban pluvial flood simulations. Water Resources Management 32(8): 2801–2815.
Article Google Scholar
Berz, G. 2000. Flood disasters: Lessons from the past—worries for the future. Proceedings of the Institution of Civil Engineers-Water and Maritime Engineering 142(1): 3–8.
Article Google Scholar
Bhola, P.K., J. Leandro, and M. Disse. 2018. Framework for offline flood inundation forecasts for two-dimensional hydrodynamic models. Geosciences 8(9): Article 346.
Cui, Z., X. Qing, H. Chai, S. Yang, Y. Zhu, and F. Wang. 2021. Real-time rainfall-runoff prediction using light gradient boosting machine coupled with singular spectrum analysis. Journal of Hydrology 603: Article 127124.
Dai, W., and Z. Cai. 2021. Predicting coastal urban floods using artificial neural network: The case study of Macau China. Applied Water Science 11(10): 1–11.
Article Google Scholar
Fang, J., W. Liu, S. Yang, S. Brown, R.J. Nicholls, J. Hinkel, X. Shi, and P. Shi. 2017. Spatial-temporal changes of coastal and marine disasters risks and impacts in Mainland China. Ocean & Coastal Management 139: 125–140.
Article Google Scholar
Ferguson, B.K., and P.W. Suckling. 1990. Changing rainfall-runoff relationships in the urbanizing peachtree creek watershed, Atlanta, Georgia. JAWRA Journal of the American Water Resources Association 26(2): 313–322.
Article Google Scholar
Frank, E., G. Sofia, and S. Fattorelli. 2011. Effects of topographic data resolution and spatial model resolution on hydraulic and hydro-morphological models for flood risk assessment. In Flood risk assessment and management, ed. S. Mambretti, and P. di Milano, 23–34. Southampton: WIT Press.
Chapter Google Scholar
Huang, G., X. Wang, and W. Huang. 2017. Simulation of rainstorm water logging in urban area based on InfoWorks ICM model. Water Resources and Power 35(2): 66–70.
Google Scholar
Jain, A., K. Nandakumar, and A. Ross. 2005. Score normalization in multimodal biometric systems. Pattern Recognition 38(12): 2270–2285.
Article Google Scholar
Kabir, S., S. Patidar, X. Xia, Q. Liang, J. Neal, and G. Pender. 2020. A deep convolutional neural network model for rapid prediction of fluvial flood inundation. Journal of Hydrology 590: Article 125481.
Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. 2017. LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30: 3146–3154.
Google Scholar
Kim, H.I., and K.Y. Han. 2020. Urban flood prediction using deep neural network with data augmentation. Water 12(3): Article 899.
Lee, Y.-M., C.-M. Ko, S.-C. Shin, and B.-S. Kim. 2019. The development of a rainfall correction technique based on machine learning for hydrological applications. Journal of Environmental Science International 28(1): 125–135.
Article Google Scholar
Li, P., Q. Wu, and C. Burges. 2007. McRank: Learning to rank using multiple classification and gradient boosting. Advances in Neural Information Processing Systems 20: 897–904.
Google Scholar
Liang, W., S. Luo, G. Zhao, and H. Wu. 2020. Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms. Mathematics 8(5): Article 765.
Liu, C., S.Q. Yin, M. Zhang, Y. Zeng, and J.Y. Liu. 2014. An improved grid search algorithm for parameters optimization on SVM. Applied Mechanics and Materials 644: 2216–2219.
Article Google Scholar
Liu, Y., S. Zhang, L. Liu, X. Wang, and H. Huang. 2015. Research on urban flood simulation: A review from the smart city perspective. Progress in Geography 34(4): 494–504.
Google Scholar
Löwe, R., J. Böhm, D.G. Jensen, J. Leandro, and S.H. Rasmussen. 2021. U-FLOOD – Topographic deep learning for predicting urban pluvial flood water depth. Journal of Hydrology 603: Article 126898.
Ma, M., G. Zhao, B. He, Q. Li, H. Dong, S. Wang, and Z. Wang. 2021. XGBoost-based method for flash flood risk assessment. Journal of Hydrology 598: Article 126382.
McLachlan, G.J., K.-A. Do, and C. Ambroise. 2004. Analyzing microarray gene expression data. New York: Wiley.
Book Google Scholar
Nash, J.E., and J.V. Sutcliffe. 1970. River flow forecasting through conceptual models part I—A discussion of principles. Journal of Hydrology 10(3): 282–290.
Article Google Scholar
Nembrini, S., I.R. König, and M.N. Wright. 2018. The revival of the Gini importance?. Bioinformatics 34(21): 3711–3718.
Article Google Scholar
Nguyen, Q.-H., H.-D. Nguyen, D.T. Le, and Q.-T. Bui. 2022. Fine-tuning LightGBM using an artificial ecosystem-based optimizer for forest fire analysis. Forest Science. https://doi.org/10.1093/forsci/fxac039.
Article Google Scholar
Ogunleye, A., and Q.-G. Wang. 2019. XGBoost model for chronic kidney disease diagnosis. IEEE/ACM Transactions on Computational Biology and Bioinformatics 17(6): 2131–2140.
Article Google Scholar
Olbert, A.I., J. Comer, S. Nash, and M. Hartnett. 2017. High-resolution multi-scale modelling of coastal flooding due to tides, storm surges and rivers inflows. A Cork City example. Coastal Engineering 121: 278–296.
Google Scholar
Pontes, F.J., G. Amorim, P.P. Balestrassi, A. Paiva, and J.R. Ferreira. 2016. Design of experiments and focused grid search for neural network parameter optimization. Neurocomputing 186: 22–34.
Article Google Scholar
Ranka, S., and V. Singh. 1998. CLOUDS: A decision tree classifier for large datasets. In Proceedings of the 4th Knowledge Discovery and Data Mining Conference 2(8): 2–8.
Google Scholar
Reichstein, M., G. Camps-Valls, B. Stevens, M. Jung, J. Denzler, N. Carvalhais, and Prabhat. 2019. Deep learning and process understanding for data-driven earth system science. Nature 566(7743): 195–204.
Article Google Scholar
Sadler, J.M., J.L. Goodall, M.M. Morsy, and K. Spencer. 2018. Modeling urban coastal flood severity from crowd-sourced flood reports using poisson regression and random forest. Journal of Hydrology 559: 43–55.
Article Google Scholar
Sandri, M., and P. Zuccolotto. 2008. A bias correction algorithm for the Gini variable importance measure in classification trees. Journal of Computational and Graphical Statistics 17(3): 611–628.
Article Google Scholar
Sidek, L.M., A.S. Jaafar, W.H.A.W.A. Majid, H. Basri, M. Marufuzzaman, M.M. Fared, and W.C. Moon. 2021. High-resolution hydrological-hydraulic modeling of urban floods using InfoWorks ICM. Sustainability 13(18): Article 10259.
Stone, M. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 36(2): 111–147.
Google Scholar
Thorndahl, S., J.E. Nielsen, and D.G. Jensen. 2016. Urban pluvial flood prediction: A case study evaluating radar rainfall nowcasts and numerical weather prediction models as model inputs. Water Science and Technology 74(11): 2599–2610.
Article Google Scholar
Varoquaux, G., L. Buitinck, G. Louppe, O. Grisel, F. Pedregosa, and A. Mueller. 2015. Scikit-learn: Machine learning without learning the machinery. GetMobile: Mobile Computing and Communications 19(1): 29–33.
Article Google Scholar
Wang, Q., P.-H. Wang, and Z.-G. Su. 2013. A hybrid search strategy based particle swarm optimization algorithm. In Proceedings of the 8th IEEE Conference on Industrial Electronics and Applications (ICIEA), 19–21 June 2013, Melbourne, Australia, 301–306.
Wu, H., and G. Huang. 2016. Risk assessment of urban waterlogging based on PCSWMM model. Water Resources Protection 32(5): 11–16.
Google Scholar
Wu, X., Z. Wang, S. Guo, W. Liao, Z. Zeng, and X. Chen. 2017. Scenario-based projections of future urban inundation within a coupled hydrodynamic model framework: A case study in Dongguan City, China. Journal of Hydrology 547: 428–442.
Article Google Scholar
Wu, Z., Y. Zhou, H. Wang, and Z. Jiang. 2020. Depth prediction of urban flood under different rainfall return periods based on deep learning and data warehouse. Science of the Total Environment 716: Article 137077.
Xu, H., K. Xu, J. Lian, and C. Ma. 2019. Compound effects of rainfall and storm tides on coastal flooding risk. Stochastic Environmental Research and Risk Assessment 33(7): 1249–1261.
Article Google Scholar
Xu, H., X. Zhang, X. Guan, T. Wang, C. Ma, and D. Yan. 2022. Amplification of flood risks by the compound effects of precipitation and storm tides under the nonstationary scenario in the coastal city of Haikou, China. International Journal of Disaster Risk Science 13(4): 602–620.
Article Google Scholar
Yamazaki, D., S. Kanae, H. Kim, and T. Oki. 2011. A physically based description of floodplain inundation dynamics in a global river routing model. Water Resources Research 47(4): Article W04501.
Zanchetta, A.D., and P. Coulibaly. 2020. Recent advances in real-time pluvial flash flood forecasting. Water 12(2): Article 570.
Zevenbergen, C., W. Veerbeek, B. Gersonius, and S. Van Herk. 2008. Challenges in urban flood management: travelling across spatial and temporal scales. Journal of Flood Risk Management 1(2): 81–88.
Article Google Scholar

Download references

Acknowledgments

This study was supported by the State Key Laboratory of Hydraulic Engineering Simulation and Safety (Tianjin University) (Grant Number HESS-2106), Scientific and Technological Projects of Henan Province (Grant Number 222102320025), Key Scientific Research Project in Colleges and Universities of Henan Province of China (Grant Number 22B570003), National Natural Science Foundation of China (Grant Number 52109040, 51739009), Excellent Youth Fund of Henan Province of China (212300410088), and Science and Technology Innovation Talents Project of Henan Education Department of China (21HASTIT011). Additionally, our cordial gratitude should be extended to the editor and anonymous reviewers for their professional and pertinent comments and suggestions, which were greatly helpful for further quality improvement of this manuscript.

Author information

Authors and Affiliations

State Key Laboratory of Hydraulic Engineering Simulation and Safety, Tianjin University, Tianjin, 300350, China
Kui Xu & Zhentao Han
Yellow River Laboratory, Zhengzhou University, Zhengzhou, 450001, China
Hongshi Xu
School of Geographic and Environmental Sciences, Tianjin Normal University, Tianjin, 300387, China
Lingling Bin

Authors

Kui Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhentao Han
View author publications
You can also search for this author in PubMed Google Scholar
Hongshi Xu
View author publications
You can also search for this author in PubMed Google Scholar
Lingling Bin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongshi Xu.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, K., Han, Z., Xu, H. et al. Rapid Prediction Model for Urban Floods Based on a Light Gradient Boosting Machine Approach and Hydrological–Hydraulic Model. Int J Disaster Risk Sci 14, 79–97 (2023). https://doi.org/10.1007/s13753-023-00465-2

Download citation

Accepted: 21 December 2022
Published: 09 February 2023
Issue Date: February 2023
DOI: https://doi.org/10.1007/s13753-023-00465-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Rapid Prediction Model for Urban Floods Based on a Light Gradient Boosting Machine Approach and Hydrological–Hydraulic Model

Abstract

Similar content being viewed by others

Machine learning algorithm for flash flood prediction mapping in Wadi El-Laqeita and surroundings, Central Eastern Desert, Egypt

Classification machine learning models for urban flood hazard mapping: case study of Zaio, NE Morocco

Integration of extreme gradient boosting feature selection approach with machine learning models: application of weather relative humidity prediction

1 Introduction