1 Introduction

Floods are natural hazards that can occur anywhere in the world (Akash et al. 2023; Balica et al. 2023; Eilander et al. 2023Morote et al. 2023). However, certain regions are more vulnerable to flooding than others (Balica et al. 2009; Sheehan 2022). Areas situated in low-lying regions and near rivers and coastlines are especially vulnerable to flooding (Nicholls et al. 2007; Grases et al. 2020; Hsiao et al. 2021). According to research conducted by the United Nations University, 80% of flood-related losses worldwide are reported in nations where substantial populations reside in low-lying areas or are adjacent to coastlines (Dilley 2005). The risk of flooding in these regions can be aggravated by climate change, with rising sea levels and more frequent extreme weather events increasing the probability of flooding (Sharma and Ravindranath 2019). Floods can cause significant damage to infrastructure, homes and crops, as well as loss of life (Mishra and Shah 2018). Additionally, floods can cause long-term economic consequences, such as trade and transportation disruptions, and their environmental implications include soil erosion and water pollution. Therefore, it is imperative that governments and communities in vulnerable regions take proactive steps to minimise the risk of flooding, including improving drainage systems, building flood barriers, and relocating communities from flood-prone areas (Ward et al. 2020).

The Sultanate of Oman (Oman) experiences occasional flooding because of its geography and climate. Because of its mountainous terrain and location on the Arabian Peninsula, the country is vulnerable to heavy rain and flash flooding. In recent years, Oman has experienced several significant floods that have damaged infrastructure, homes, and businesses. In 2007, Cyclone Gonu caused widespread flooding and landslides, leading to the evacuation of thousands of people and significant damage to roads, bridges, and buildings (Fritz et al. 2010). In 2018, Cyclone Mekunu caused severe flooding in southern Oman, resulting in several fatalities and extensive damage to homes, infrastructure, and agricultural land (Müller et al. 2020). In addition to cyclones, Oman also experiences seasonal monsoon rains that can cause flooding in low-lying areas. In 2021, heavy rains led to flooding in several parts of the country, resulting in road closures, disrupted transportation, and damaged infrastructure (Sun et al. 2022).

The impact of the Cyclone Shaheen on the Arabian Peninsula was immense, causing widespread destruction and chaos in its wake. This event was a sobering reminder of the critical need for robust preparedness measures and effective disaster management systems to mitigate the devastating consequences of extreme weather events (Terry et al. 2022). Specifically, the cyclone’s path across the northern part of Oman, including Al-Batinah in the south, led to heavy rainfall and flooding (Al Maghawry 2022; Almufarji and Husin 2022; Al Nadhairi et al. 2023). The occurrence of tropical cyclones poses a sporadic yet significant threat to Oman, because such events have the potential to result in extensive damage and flooding throughout the country. In response, Oman has implemented a comprehensive national disaster management system that encompasses early warning systems, evacuation plans, and emergency response teams. Nonetheless, the vulnerability of the country to flooding and its associated impacts remains a persistent concern, highlighting the continual need for monitoring and preparedness measures to mitigate the deleterious effects of natural disasters.

Despite this urgent requirement, in the literature, there is a dearth of studies that have focused on the variables and factors that contribute to flooding and its associated risks in Oman (Al-Naamani 2016). This knowledge gap underscores the urgent need for intense scientific enquiry to further our understanding of the fundamental mechanisms underlying flooding events in the country. Understanding the variables that exacerbate flooding in arid and semiarid countries is a pressing concern for several reasons. First, the countries in these regions often lack the infrastructure and resources necessary to cope with the disastrous effects of floods. As a result, flooding causes substantial loss of life, damages infrastructure and property, and displaces individuals. Furthermore, climate change continues to cause an increased frequency and intensity of extreme weather events, including floods, in arid and semiarid regions. Furthermore, an in-depth understanding of the variables that exacerbate flooding can inform policy decisions and guide governments in prioritising investments in disaster preparedness and response. The development of effective early warning systems, evacuation plans, and emergency response teams can significantly reduce the damage caused by floods. However, such measures must be specific to the conditions in each region to account for the factors that cause flooding. Therefore, it is essential to conduct studies that focus on the variables and factors that lead to flooding in an effort to better comprehend the mechanisms underlying these events in arid and semiarid countries.

Thus, Geographic Information Systems (GIS) have found extensive applications in various domains of hydrology. In the realm of hydrology, GIS has been utilised in a wide range of areas, including water quality assessment (Habeeb and Weli 2021), urban heat island analysis (Liu et al. 2017), climate change impact studies (Henry 2023), drought monitoring (Adedeji et al. 2020), hydrological modelling (Thakur et al. 2017), and flood analysis (Kourgialas and Karatzas 2011), among others. This technology offers a powerful toolkit for the spatial analysis and visualisation of hydrological data, enabling researchers to gain insights into complex hydrological processes (Wang et al. 2015a). GIS, remote sensing, and machine learning methods are essential tools for modelling floods because of their ability to analyse and process large amounts of spatial data (Das 2018; Shahabi et al. 2020; Guptha et al. 2021, 2022; Swain et al. 2022; Talukdar et al. 2022a; Patel et al. 2022). Coulibaly et al. (2000) employed a hierarchical modelling approach by utilising neural networks and GIS to analyse the spatial variability and hydrological responses in a watershed, demonstrating its effectiveness in predicting streamflow and flood events. Domeneghetti et al. (2018) combined hydraulic and machine learning models to map the extent of flooding in the Secchia River basin, demonstrating the potential of machine learning algorithms to improve the accuracy and efficiency of flood-extent mapping. Chen et al. (2020a, b) utilised machine learning algorithms to map floods based on synthetic aperture radar (SAR) data in areas where traditional methods are not feasible. The study showed that machine learning techniques were effective in mapping floods in complex environments.

The utilisation of GIS and remote-sensing methodologies to model flooding in Oman remains limited (Hoque et al. 2017). However, the emergence of machine learning techniques can play a pivotal role in the management of future cyclone-related flood risks in the region. A comprehensive and accurate understanding of flood risks and their potential effects can be realised through the integration of GIS, remote sensing, and machine learning. GIS can store, manipulate, and analyse geospatial data, remote sensing can provide up-to-date and high-resolution data, and machine learning can be used to analyse substantial amounts of data to identify patterns and develop accurate flood models. By synergising these modalities, valuable information can be gleaned for use in effective disaster management in regions frequently afflicted by floods, such as Oman.

The present study assesses flood susceptibility in Wilayat As-Suwayq, Al-Batinah North Governorate, Oman, by bringing together a comprehensive consideration of various environmental factors. Although extensive regional and global research has been conducted to evaluate cyclone risks along the Oman Sea coasts (Al-Hinai and Abdalla 2021; Mansour et al. 2021), a significant knowledge gap exists. To the best of our knowledge, there is a limited body of research that has leveraged machine learning techniques to identify variables for mapping flood-prone areas, both in the broader Al-Batinah North Governorate (Moradi et al. 2021) and the specific As-Suwayq region (Elmahdy et al. 2020). In the current study, we employ advanced machine learning methods, including extreme gradient boosting (XGB), random forest (RF), and CatBoost (CB), to model and predict flood condition factors and assess their efficacy in mapping regions prone to flooding. The research harnesses two distinct datasets to investigate these flood-conditioning factors comprehensively. The first dataset encompasses variables, such as altitude, slope, aspect, curvature, stream power index (SPI), topographic wetness index (TWI), and topographic roughness index (TRI). The second dataset integrates LiDAR-derived factors with geological, soil, land use/land cover (LULC), distance from roads and distance from rivers. This multidimensional approach aims to provide a holistic understanding of flood susceptibility within the study area.

2 Materials and Methods

2.1 Study Area

Wilayat As-Suwayq is positioned in the Al-Batinah North Governorate (23.8262° N, 57.4288° E), situated in the northern part of Oman as illustrated in Fig. 1. According to the Directorate General of Meteorology in Oman (2022), As-Suwayq receives an average annual rainfall of 109 mm. The wettest months occur from December to March, while the driest period extends from May to September (https://met.gov.om/). The average annual maximum temperature in As-Suwayq is 34 °C, reaching a daily maximum exceeding 49 °C during the summer months, while the mean minimum annual temperature stands at 22 °C. As-Suwayq spans an area of approximately 1000 km2, with a total cultivated land of 80 km2. A study conducted by Choudri et al. (2013) revealed that As-Suwayq exhibits the highest population density and intensive agriculture in Al-Batinah North, utilising the largest amount of groundwater, primarily for irrigation purposes. The predominant crops in As-Suwayq include squash, Rhodes grass, banana, date palm, alfalfa, and tomato (Al-Aufi et al. 2020). The agricultural season in this region spans from October to April, aligning with temperature and rainfall patterns (Alshemmari et al. 2023). The primary rainfall season in northern Oman occurs between December and April. Table 1 illustrates the primary data used to study the role of the key conditioning factors in flood susceptibility mapping through machine learning approaches in the study area.

Fig. 1
figure 1

a Flood area location (As-Suwayq) and b elevation insights

Table 1 Primary research data

2.2 Dataset Preparation for Spatial Modelling

2.2.1 Elevation

Elevation is used to model flood risk, because it affects the flow of water during flood events. When water flows downhill, it follows the path of least resistance, which is usually the path of steepest descent (Scotti et al. 2020). Hence, areas with higher elevations are less prone to flooding than areas with lower elevations. However, even areas at high elevations can be at risk of flooding if they are located in basins or low-lying areas. In addition, areas at lower elevations are more likely to experience severe flooding, particularly in areas where water cannot easily drain away. Therefore, accurate elevation data are crucial for modelling flood risk. Digital elevation models (DEM) are commonly used to represent the elevation of the earth’s surface, and they can be combined with other spatial information, such as hydrological and land-cover data, to model the flow of water during flood events (Muthusamy et al. 2021). Based on elevation data, GIS and modelling tools can identify areas that are likely to be affected by flooding, estimate the extent and severity of flooding, and assess potential effects on infrastructure, communities, and the environment. Moreover, accurate elevation data are essential for developing effective flood risk management strategies. Therefore, elevation is crucial to consider when modelling flood risk. In a study conducted in As-Suwayq, LiDAR data were used to produce a DEM with a 5 m, two-dimensional resolution acquired from the National Survey Authority in Oman (http://nasom.org.om, accessed 22 October 2022). The resulting DEMs were used to derive thematic maps of flood condition variables, thus demonstrating the usefulness of high-resolution elevation data in flood risk modelling (Fig. 2a).

Fig. 2
figure 2

Conditioning variables: a elevation, b NDVI, c distance from drainage, d drainage density, e distance from roads, f geological map, g soil map, h rainfall map, i TRI, j TWI, k curvature, l slope, and m SPI

2.2.2 Normalised Difference Vegetation Index (NDVI) Using Sentinel-2 Data

The normalised difference vegetation index (NDVI) is valuable data for assessing flood risk, because it provides crucial information about the quality and amount of vegetation cover in a particular area. The NDVI is a remote-sensing index that measures the difference between the reflectance of near-infrared and red light in vegetation. The index ranges from − 1 to 1, and higher values indicate healthier and more abundant vegetation cover compared with lower values. In flood risk management, vegetation is an important factor because of its ability to absorb and slow water movement, prevent soil erosion, and maintain the stability of riverbanks and floodplains (Wang et al. 2015b; Soltani et al. 2021). Areas with higher vegetation cover are less prone to flooding, while areas with lower vegetation cover are likely to experience flooding and soil erosion. Therefore, the NDVI can be used to identify areas that are at a high risk of flooding and to assess the potential effects of floods on vegetation cover (Farhadi and Najafzadeh 2021).

In flood modelling, combining data derived from the NDVI with other spatial data, such as elevation and hydrological measurements, can assist in modelling the flow of water during flood events and in identifying flood-prone areas. Moreover, remote sensing and GIS tools can utilise NDVI data to identify regions with high vegetation cover, which are vital in planning flood mitigation and adaptation measures, such as green infrastructure and wetland restoration. In the present study, Sentinel-2 satellite data were used to calculate the NDVI in the study area (Fig. 2b). Sentinel is an earth satellite programme designed, managed and launched by the European Space Agency. Sentinel-2A and B are multispectral, high-resolution land observation satellites that capture images in 13 bands and at multiple geometrical resolutions. The present study used a free-cloud image and classified it into four major land-use classes: vegetation, bare land, developed land, and water bodies (https://www.sentinel-hub.com/, accessed on 10 October 2022). The accuracy of the land-use classification was calculated as 96% using the Kappa index.

2.2.3 Distance from Drainage

The distance from drainage systems, such as rivers, streams, and canals, is crucial in flood risk modelling. GIS and tools can use data on drainage networks and elevations to identify high-risk areas, because they are more likely to experience flooding if the drainage system is overwhelmed by water (Kazakis et al. 2015). Using GIS and modelling tools, integrating drainage distance data can identify flood-prone areas, estimate flooding severity, and evaluate potential impacts on infrastructure, communities, and the environment. This information aids in flood risk maps, land-use planning, and infrastructure design while helping in developing effective flood risk management strategies. To model the distance from drainage systems in the study area, the Euclidean distance method in ArcGIS Pro was utilised (Fig. 2C).

2.2.4 Drainage Density

A drainage density map indicates the number of stream channels per unit area in a watershed affecting water movement during flood events (Ogden et al. 2011). GIS and modelling tools use drainage network data to simulate flood flow, identify high-risk areas, estimate flood severity, and assess potential impacts on infrastructure, communities, and the environment (Levavasseur et al. 2015). The relationship between drainage density and flooding is crucial for flood risk maps, land-use planning, and infrastructure design. It helps identify high-risk areas and estimate flood effects, providing valuable insights into effective flood risk management strategies. Incorporating drainage density data into flood models is essential (Ogden et al. 2011; Mahmoud and Gan 2018). In the present study, drainage density was calculated using a DEM as the input data in ArcGIS Pro 3.0 (Fig. 2d).

2.2.5 Distance from Roads

The distance of an area from roads is a commonly used factor in flood modelling (Tehrany et al. 2019; Khosravi et al. 2020; Al-Hinai and Abdalla 2021), because it can influence the distribution of runoff and flooding. Roads can significantly affect water flow during floods, because they can act as barriers or can direct water into concentrated channels, thus increasing the risk of flooding in specific areas. In flood modelling, in addition to data on slope, elevation, and land cover, the distance of an area from roads is a conditioning variable utilised to generate accurate flood susceptibility maps (Gudiyangada Nachappa et al. 2020). These maps are useful in identifying areas that are the most vulnerable to flooding, thus facilitating the development of flood mitigation strategies and emergency preparedness plans. The distance of an area from roads can also provide insights into the effects of urbanisation and changes in land use on flood risk, because the proximity of roads to water sources can increase with development and infrastructure expansion, potentially exacerbating flood risk in surrounding areas. Therefore, the distance from roads is a critical factor to consider when evaluating flood risk and designing flood management strategies. To model the distance from roads in the study area, the Euclidean distance method was employed in ArcGIS Pro 3.0 (Fig. 2e).

2.2.6 Geological Map

Geological maps offer crucial insights for flood modelling and risk assessment, revealing the distribution and characteristics of rocks and soils, which can significantly impact water flow during floods, hence aiding in identifying flood-prone areas and developing effective risk management strategies (Kourgialas and Karatzas 2011). For example, understanding soil properties can help determine the mitigation measures that are the most suitable in a particular area (Aslan et al. 2005; Donnelly et al. 2006). Geological maps can help mitigate flood damage by understanding the effects of geological formations on water flow; they show the distribution and properties of rocks and soils and provide information on water movement. These maps can identify soil types, rocks, and sediment buildup, hence influencing water storage and infiltration; they can also be combined with flood modelling to develop effective flood management strategies.

In the present study, a geological map was used to apply the flood model. The study area encompassed four distinct geological types: quaternary surficial deposits, Sumeini and Hawasina, Samail ophiolite, and tertiary sedimentary cover (Fig. 2f). Each type represented a unique geological formation found in the study area, providing valuable information for understanding the area’s geology and potential natural resources. The geological map was obtained from the Geological Survey of Oman (www.pdo.co.om/).

2.2.7 Soil Map

Soil type significantly impacts flood modelling by affecting groundwater absorption and runoff potential. Soil maps help identify areas that have high runoff potential or low infiltration rates, aiding flood risk assessments and floodplain management strategies (Haghizadeh et al. 2017). Combining soil maps with other factors, such as slope, land use and topography, helps develop accurate flood susceptibility maps (Al-Juaidi et al. 2018). Soil maps can also be used to identify areas where infiltration-enhancing techniques, such as rainwater harvesting, green infrastructure, and soil conservation practices, may be effective in reducing flood risk. Gravel, loam, rock, and sand are among the soil types examined in the present study (Fig. 2g). The soil type map was obtained from the MAFWR (https://www.maf.gov.om/, accessed 10 September 2021).

2.2.8 Rainfall

Precipitation is a key factor in flood risk modelling, because it causes rivers and water bodies to overflow, affecting the severity and extent of flooding (Jenkins et al. 2018; Pham et al. 2022). Data such as historical records, real-time monitoring, and climate projections can be used to incorporate rainfall data into flood risk models. Tools, such as rain gauges, radar, and satellites, collect rainfall data, which are then analysed using GIS to identify flood-prone areas (Anusha and Bharathi 2020). This information can be combined with other data, such as topography, land cover, and soil type, to help model potential flood events and develop early warning systems and emergency response planning (Liu et al. 2022). This helps mitigate flooding effects and save lives. In the present study, rainfall data from 1975 to 2021 were collected from two stations in the city on the website of the Civil Aviation Authority of Oman (http://met.gov.om, accessed May 10, 2022). Using ArcGIS Pro 3.0, geostatistical inverse distance weighted interpolation was employed to generate a thematic map of the data, as shown in Fig. 2h.

2.2.9 Topographic Roughness Index (TRI)

The TRI measures variations in elevation within a given area, which are calculated using the following equation:

$${\text{TRI}} = \left( {\Sigma |hi - h{\text{avg}}|} \right)/n,$$
(1)

where Σ|hi—havg| is the sum of the absolute differences between the elevation of each point in the area and the average elevation of the entire area, havg is the average elevation of the entire area, and n is the total number of points in the area (Hojati and Mokarram 2016; Mojaddadi et al. 2017). ArcGIS Pro 3.0 was used to calculate the TRI from DEMs using terrain analysis tools. Areas with higher TRI values had a more complex topography that trapped and slowed water flow, increasing the risk of flooding (Fig. 2i). Therefore, the TRI is an important measure in modelling flood risk areas. It can be used in conjunction with other data, such as land use, drainage, and rainfall data, to develop comprehensive flood risk models (Pourali et al. 2016).

2.2.10 Topographic Wetness Index (TWI)

TWI is commonly used in flood modelling. It is a topographical index that characterises the relative wetness of a landscape based on its topography. The TWI is calculated by dividing the upslope contributing area by the contour length of a hillslope, which represents the log-transformed ratio of the upslope contributing area to the contour length. This index is expressed as a positive value, and higher values indicate wetter areas (Aksoy et al. 2016).

The TWI parameter, which is closely related to soil moisture content, is crucial for mapping flood-prone areas. Combining it with other factors, such as slope, curvature, aspect and elevation, creates flood susceptibility maps, all of which aid in mitigation strategies. The following two equations are used to calculate the TWI: TWI = ln(a/tan β) and TWI = ln(a/(tan β + 0.05)). The specific catchment area (a) represents the area contributing to a particular point on the landscape per unit of contour length, and tan β represents the slope angle at that point. The first equation is used to produce a normalised value of TWI that is proportional to the potential for soil moisture retention and hydrologic processes. The second equation resembles the first but has an additional value (0.05) to prevent zero division errors in regions with flat slopes. This equation was applied to compute the TWI in the study region, as shown in Fig. 2j, which depicts the TWI distribution.

2.2.11 Curvature

Curvature plays a vital role in modelling flood risk because it affects the flow of water across the landscape. This refers to the degree of convexity or concavity of the land surface, which can be determined through an analysis of DEM data (Chapi et al. 2017). Curvature affects the flow of water and amount of erosion during flood events. In areas with a high positive curvature, the land surface is convex, which slopes upward in all directions. This leads to faster water flow and an increased risk of erosion. Conversely, areas with high negative curvatures are concave, so the land surface slopes downward, leading to slower water flow and increased risk of ponding.

By integrating curvature data into flood risk models, areas that are prone to flooding based on the shape of the terrain can be identified. This information can guide land-use planning, infrastructure design, and emergency response planning. Moreover, curvature data can be utilised to simulate water flow during flood events in hydraulic models that simulate various scenarios, such as different levels of rainfall or varying degrees of urbanisation. The curvature data in the present study area were derived from a DEM with a 5 m resolution, as shown in Fig. 2k.

2.2.12 Slope

Slope is an important factor in modelling floods, because it affects the speed and direction of water flow. Slope refers to the degree of steepness of the land surface. It is calculated using DEM data (Ajmal et al. 2020). In general, areas with higher slopes tend to be associated with faster water flow and an increased risk of erosion, while flatter areas may be more prone to ponding and slower water flow. However, the effects of slope on flooding can vary depending on other factors, such as land use and soil type (Brandt 2016).

Various techniques can be used in GIS to incorporate slope data into flood risk models. A common method is to calculate slopes using the DEM and then to combine this information with other relevant data, such as land use and soil type. This can help identify areas that are at higher risk of flooding because of the slope of the land, as well as other factors that may affect water flow. In addition, slope data can be used to model the flow of water during flood events. For example, hydraulic models can be developed to simulate water flow in different scenarios, such as varying levels of rainfall and diverse amounts of urbanisation. This information can help identify areas that are at risk of flooding and inform emergency response planning. In the present study, a DEM with a 5 m resolution was used to calculate the slope in the study area (Fig. 2l).

2.2.13 Stream Power Index (SPI)

The SPI is commonly used in flood modelling, particularly to predict flash floods in mountainous areas (Turoğlu and Dölek 2011; Ngo et al. 2018). The SPI is a measure of the potential amount of energy available to move water and sediment through a channel, which is calculated based on the slope, drainage area, and hydraulic radius of a river or stream. In flood modelling, the SPI can be combined with other conditioning factors, such as slope, elevation, and land cover, to develop accurate flood susceptibility maps. SPI is particularly useful in mountainous regions, where rapid runoff and steep slopes can lead to flash flooding. By incorporating SPI into flood models, researchers can better understand the effects of water flow and sediment transport on the potential for flooding in a particular area.

In flood modelling, the SPI has been demonstrated to be effective, because it is a useful metric for characterising the hydraulic and sediment transport properties of a river or stream. By integrating the SPI into flood models, researchers and planners can gain a better understanding of the potential for flooding in a given area and, hence, develop effective strategies to mitigate flood risk. In the present study, a DEM was utilised to calculate the SPI in ArcGIS Pro 3.3 (Fig. 2m).

2.2.14 Flood Data

To represent the key factors that contribute to flooding in the area under investigation, we utilised the random point extension feature in ArcGIS Pro 3.0 to generate both flood and nonflood points. Flood points are regions that are prone to flooding, whereas nonflood points are normally considered flood-free. However, these designations can vary over time due to variables, such as urbanization, climate change, and infrequent catastrophic events, making flood risk a dynamic term. The total number of points generated was 446, of which 311 points represented floods and 135 represented nonfloods. We randomly divided the datasets into two subsets: 70% for training and 30% for validation. Figure 3 summarises the methodology and data employed to evaluate the vulnerability of the study area to floods.

Fig. 3
figure 3

Flowchart illustrating the sequential steps involved in the data analysis process within the study area

2.3 Multicollinearity Analysis

To detect multicollinearity, we first examined the correlation matrix of the predictor variables to identify high correlation coefficients. This initial step allowed us to identify pairs of variables that were highly correlated and could potentially lead to multicollinearity issues. We further quantified the degree of multicollinearity using variance inflation factors (VIF) for each predictor variable (Daoud 2017). The VIF values provided insights into the extent to which the estimated coefficients were affected by multicollinearity. Variables with high VIF values were considered to be candidates for removal or consolidation to improve the stability of our regression model.

2.4 Machine Learning Methods

Extreme gradient boosting (XGB), Random forest (RF), and CatBoost (CB) were selected model and predict flood condition factors and assess their efficacy in mapping regions prone to flooding. XGBoost was chosen for its proven robustness and efficiency in handling structured data, which is highly relevant to our hydrological dataset’s structured nature (Ma et al. 2021; Jarajapu et al. 2022). RF, as an ensemble method, was selected for its versatility in handling both classification and regression tasks effectively. Importantly, RF is known for its ability to mitigate overfitting, which is a critical concern in hydrological and flood modelling (Chen et al. 2020b). CatBoost, a newer addition to the gradient boosting family, was incorporated for its advantages in managing categorical features and delivering robust performance without requiring extensive hyperparameter tuning. To address potential issues related to model generalisation and overfitting, we implemented a set of strategic measures. First, we adopted k-fold cross-validation as part of our methodology, allowing us to assess the models’ generalisation performance and estimate how well they would perform on unseen data. Furthermore, we conducted thorough hyperparameter tuning, which is a crucial step in striking the right balance between under- and overfitting. This process involved optimising parameters, such as learning rates, tree depths, and regularisation terms. We also emphasised feature engineering to ensure that our models were not overly complex and did not suffer from high dimensionality. Moreover, to comprehensively evaluate model performance and make informed choices, we conducted a rigorous model comparison in our methodology. This comparison involved assessing the performance of XGB, RF, and CB on various metrics, including accuracy, precision, recall, and F1-score. By undertaking this meticulous analysis, we were able to identify the most suitable model for our specific hydrological prediction task.

2.4.1 Extreme Gradient Boosting (XGB)

XGBoost is a machine learning technique widely used to address problems in regression, classification, and ranking. It is based on the power of ensemble learning, which combines the predictions of multiple weak models to form a more accurate and robust model (Mirzaei et al. 2021; Mia et al. 2022). A key advantage of XGBoost is its ability to prevent overfitting and improve generalisation by enabling a regularised model representation. The XGBoost algorithm was applied to construct a sequence of decision trees, each of which was designed to correct the mistakes of the previous tree. In each iteration, the algorithm computes the gradient and hessian of the loss function with respect to the current model’s predictions. These values are then used to build a new tree that predicts the residuals of the prior model’s forecast. The new tree is added to the ensemble, and the process is repeated until a specified number of trees has been reached.

XGBoost is also flexible in handling missing values in datasets, because it allocates them to the leaf node in each tree that has the largest gain. Additionally, the algorithm has several hyperparameters, such as the number of trees, learning rate, maximum depth of each tree, and regularisation parameters, which can be fine-tuned to optimise performance. Cross-validation techniques were used to identify the best set of hyperparameters for the algorithm (Saravanan and Abijith 2022). XGBoost can be used in a wide range of applications, including customer churn prediction, fraud detection, and stock price forecasting. Because of its high accuracy, speed, and scalability, this algorithm is preferred by many machine learning practitioners (Alqahtani et al. 2019).

2.4.2 Random Forest (RF)

RF is a machine learning algorithm that can be used to predict floods. RF is an ensemble learning method that combines the predictions of many decision trees to create a strong predictive model (Sharma et al. 2019). The algorithm was used to construct multiple decision trees, each of which was built using a random subset of the training data and random subset of the predictor variables. Each tree in the forest is trained to predict the outcome variable (e.g., flood occurrence) based on the predictor variables.

To make a prediction using the RF algorithm, the predictor variables are fed into each tree in the forest, and the outcome variable of each tree is predicted (Saravanan and Abijith 2022). The final prediction is then made by aggregating the predictions from all the trees in the forest using a simple average for regression problems or a majority vote for classification problems. An advantage of the RF algorithm is that it can manage high-dimensional datasets with many predictor variables. It is also less prone to overfitting than some other machine learning algorithms, because each tree is constructed using a random subset of the data and variables. Additionally, RF can help identify the variables that are the most important in predicting the outcome variable, which can be useful for understanding the factors that contribute to flood occurrence (Eini et al. 2020; Mahato et al. 2021). In applying RF to predict flooding, historical data on floods and related variables (e.g., rainfall, river levels, topography, and land use) can be used to train the algorithm. The trained RF model can then be used to predict the likelihood of floods in new locations based on the predictor variables available. The accuracy of the RF model can be evaluated using performance metrics, such as the area under the receiver-operating characteristic (ROC) curve or mean squared error (MSE).

2.4.3 CatBoost (CB)

CB is a machine learning algorithm that can be used to predict floods. CB is a type of gradient boosting algorithm designed to automatically manage categorical variables and missing data. Similar to other gradient boosting algorithms, CB builds an ensemble of decision trees in which each tree is built to correct mistakes in the previous tree.

A key advantage of CB is that it can manage categorical variables without the need for one-hot encoding or other preprocessing steps. CB is useful for predicting floods in which categorical variables, such as land use, soil type, and land cover, are important predictors. Additionally, CB uses the novel method of ordered boosting, which enables the algorithm to use the natural order of categorical variables in the training process, thus improving the accuracy of the model and reducing the need for feature engineering.

Historical data on floods and related variables can be used to train the CB algorithm to predict flooding in an area. The algorithm can handle missing data, which can be useful in dealing with incomplete or partially missing datasets (Eini et al. 2020). The trained CB model can then be used to predict the likelihood of floods occurring in new locations based on the available predictor variables. The accuracy of the model can be evaluated using performance metrics such as the area under the ROC and MSE. The ability of the CB algorithm to handle these types of data can simplify preprocessing and improve the accuracy of the model (Saravanan et al. 2023).

2.5 Model Performance Assessment

In evaluating flood hazard models during the testing stage, it is important to use multiple metrics to ensure that they are accurate and reliable. In addition to the area under the curve (AUC), two metrics commonly used to model flood hazards are positive predictive value (PPV) and negative predictive value (NPV) (Milanesi et al. 2015; Wing et al. 2020; Ha et al. 2023). The AUC is a measure of how well a model can distinguish between positive and negative cases. It provides a single overall measure of model performance. The higher the AUC, the better the model performs.

PPV was used to measure the proportion of true-positive cases among all positive cases predicted by the model. In the context of flood hazard modelling, true-positive cases represent areas that are at risk of flooding and are correctly identified by the model. The higher the PPV, the more effective the model is in identifying areas that are at risk of flooding. In contrast, NPV measures the proportion of true-negative cases among all negative cases predicted by the model. In the context of flood hazard modelling, true-negative cases represent those areas that are not at risk of flooding and are correctly identified by the model. The higher the NPV, the more effective the model is in identifying areas that are not at risk of flooding. Using multiple metrics, such as AUC, PPV, and NPV, during the testing stage of flood hazard modelling, the model’s overall performance can be comprehensively evaluated. A model with a high AUC, high PPV and high NPV is considered the most accurate and reliable for use in developing the final flood hazard map. In the present study, a two-stage training and testing approach was used to evaluate three flood hazard models. Four evaluation criteria were used: sensitivity, specificity, PPV, NPV, and AUC. Using multiple metrics, the performance of the models was thoroughly assessed, and the most accurate and reliable model was selected to create the final flood hazard map.

3 Results

3.1 Model Input Variable

To identify independent variables that are appropriate for mapping flood-prone areas in As-Suwayq, a multicollinearity study was conducted using two criteria: VIF and tolerance (Talukdar et al. 2022b). As shown in Table 2, the results revealed that, of the 13 independent variables analysed, elevation had the highest collinearity, with a VIF of 4.32 and a tolerance of 0.23. However, all variables had VIF values < 5, indicating the absence of high collinearity. Thus, these 13 independent variables were deemed appropriate for modelling and mapping flood-prone areas in As-Suwayq.

Table 2 The results of multicollinearity analysis for the independent variables

3.2 Flood Hazard Mapping

Flood hazard maps were predicted based on three machine learning algorithms: XGB, RF, and CB (Fig. 4). The flood risk assessment conducted in As-Suwayq yielded several key findings regarding the vulnerability of different areas to flooding. One of the most significant findings was that many areas in the south, central, eastern, and western parts of the city were located in high- and very high-risk zones (Fig. 4). These areas are susceptible to flooding because of the presence of water bodies, such as rivers and canals, that overflow their banks during heavy rainfall events. The development of the eastern and western parts of As-Suwayq further exacerbates the risk of flooding. The construction of impermeable surfaces, such as roads and buildings, has prevented water from being absorbed into the ground, leading to faster and more concentrated water flows during heavy rainfall events. Poorly designed drainage systems and the absence of flood protection infrastructure have also contributed to the vulnerability of these areas to flooding.

Fig. 4
figure 4

Flood hazard maps predicted by three machine learning algorithms: a XGB, b RF, and c CB

In addition to the eastern and western parts of As-Suwayq, the city centre is also at a high risk of flooding. Because it is located between the mountains and sea, it is particularly vulnerable to flooding during heavy rainfall events. The rapid development of the city centre has exacerbated the risk of flooding, because impermeable surfaces prevent water from being absorbed into the ground. A significant portion of the areas south of As-Suwayq are also at an extremely high risk of flooding (Fig. 4). This is primarily because of their proximity to the mountains, which contribute to flooding through several mechanisms. Intense rainfall events generate large volumes of runoff, and landslides can block streams or create new ones. Steep mountain slopes can increase the speed and force of water as it flows downhill, potentially causing significant damage to property and infrastructure. Effective strategies for managing flood risk in each area of As-Suwayq could include the construction of protective infrastructure, such as floodwalls, levees, and barriers, as well as improved drainage systems that can better channel heavy runoff. In addition, land-use management and green infrastructure solutions, such as permeable pavements and green roofs, can help reduce the number of impermeable surfaces and promote natural water retention. Early warning systems and public education campaigns can also aid the residents of As-Suwayq in preparing for potential flooding and in taking appropriate actions to protect themselves and their property.

3.3 Significance of Values

Based on the information provided, the predictive abilities of three different machine learning models—XGB, RF, and CB—were assessed for a target variable using various input variables and features. The CB model achieved higher evaluation metrics than the XGB and RF models in most of the input variables (Table 3). The scores for the importance of features in each input variable were also evaluated in each model to determine their relative contributions to predicting the target variable. The results showed that terrain features, particularly curvature, were consistently the most significant predictors of the target variable. Curvature had the highest score (i.e., 100) for importance across all three models, suggesting that it played a crucial role in determining the value of the target variable (Table 3). Elevation, which is another terrain feature, was also shown to be a critical predictor, with the second-highest score for the importance of all three models. SPI, TWI, and NDVI were identified as additional significant features, because they moderated the importance scores across all three models (Table 3). These results suggest that the roles of terrain features and vegetation indices in predicting the target variable are significant.

Table 3 Importance scores of variables for flood prediction based on XGB, RF, and CB algorithms

Furthermore, the relative importance of each feature varied among the three models. Therefore, it was crucial that the input features and models be selected to achieve the best predictive performance in addressing a given problem. Therefore, these results provide insights into the importance of terrain features, particularly curvature and elevation, in predicting the target variable, regardless of the specific machine learning model used.

3.4 Model Validation

We evaluated three models using two stages: training and testing. Five criteria were used in the evaluation: sensitivity, specificity, PPV, NPV, and AUC. During the training stage of the XGB, the results were as follows: sensitivity = 0.96, specificity = 0.97, PPV = 0.96, NPV = 0.97, and AUC = 0.98. The results of the XGB model testing stage were as follows: sensitivity = 0.81, specificity = 0.85, PPV = 0.84, NPV = 0.82, and AUC = 0.89. In the RF model, the results of the training stage were as follows: sensitivity = 0.96, specificity = 0.98, PPV = 0.97, NPV = 0.98, and AUC = 0.99. The results of the RF model testing stage were as follows: sensitivity = 0.82, specificity = 0.87, PPV = 0.86, NPV = 0.83, and AUC = 0.90. In the CB model, the results of the training stage were as follows: sensitivity = 0.97, specificity = 0.99, PPV = 0.99, NPV = 0.98, and AUC = 0.99. The results of the CB testing stage were as follows: sensitivity = 0.83, specificity = 0.85, PPV = 0.86, NPV = 0.83, and AUC = 0.91 (Table 4).

Table 4 Evaluation results of models in the training and testing stages

Overall, the XGB and RF models performed better during the training stage, while the CB model performed better during the testing stage. However, all three models had similar AUC scores in the testing dataset, which was the most important metric for evaluating their performance (Fig. 5). Based on these results, we recommend using either CB or XGB to predict floods. Both models had high AUC scores in the training and testing datasets. RF is a viable option, but its performance in the testing dataset was slightly lower than in the other two models.

Fig. 5
figure 5

Testing stage of flood hazard models based on AUC: a XGB, b RF, and c CB

4 Discussion

Flooding in dry riverbeds and wadis can cause catastrophic natural disasters and pose significant challenges to prevention and control measures (El-Haddad et al. 2021; Ali et al. 2022). This study aimed to investigate the effects of individual conditioning variables on flood susceptibility mapping and their relevance in accurately mapping potential flood-prone areas in Wilayat As-Suwayq, Al-Batinah North Governorate. Three machine learning models (XGB, RF, and CB) were trained and evaluated to study the prediction of floods based on various input variables. The input variables encompassed a comprehensive set of topographical features, including curvature, elevation, slope, SPI, TWI, TRI, and NDVI, alongside essential environmental factors such as annual rainfall, drainage density, proximity to roads and wadies (dry rivers), soil composition, and geological characteristics (Fig. 3). These variables hold widespread utility in comprehending and modelling flood-prone areas (Kirkby et al. 2002; Komolafe et al. 2020; Roy et al. 2020).

Although we are confident in the suitability of the chosen models for our research, we acknowledge that alternative modelling options exist. Some of these, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), are particularly well suited for handling time series data (Gebrehiwot et al. 2019). However, these deep learning models often require more extensive datasets and substantial computational resources, which may not have been feasible within the scope and constraints of our study.

The results showed that the CB model outperformed the other two models (XGB and RF) in predicting floods based on the given input variables (see Fig. 4). Identifying the variables that were the most significant in causing floods in the study area was crucial for accurately mapping and predicting flood-prone areas. The three most significant variables were curvature, elevation, and slope, all of which were related to the topography of the area, indicating that terrain plays a critical role in flood susceptibility (see Table 2). This finding is consistent with previous research in which topography was shown to be among the most important factors in flood occurrence (Thomas Steven Savage et al. 2016; Choubin et al. 2019).

The finding that curvature was the most critical variable across all three models, with an importance value of 100%, demonstrates its significance in accurately mapping flood-prone areas (see Table 2). Curvature represents the degree of bending or curvature in the landscape. In river channels, curvature can indicate the potential for flow convergence or divergence, as well as the degree of channel confinement (Zinger et al. 2013). Thus, curvature can influence the likelihood of flooding. Areas with high curvature may be more prone to channel blockages or water overflows during high-water events. Moreover, curvature can affect the speed and direction of water flow, which can contribute to the formation of eddies, turbulence and other hydraulic phenomena that can exacerbate flood hazards (Parvin et al. 2022). Therefore, it is important to consider curvature in flood risk assessments and management plans, especially in areas with high levels of curvature, such as As-Suwayq City. Several previous studies have found that curvature was the most important feature in flood susceptibility mapping, which is consistent with our findings (Yariyan et al. 2020; Arabameri et al. 2022).

Although curvature was identified as the most important variable in flood susceptibility mapping in the study area, elevation was identified as a highly significant factor. However, the ranking of these variables may have been influenced by various factors, including the specific terrain and characteristics of the study area and machine learning algorithms used in the analysis. Elevation is a key variable in flood susceptibility mapping, because it influences the direction and speed of water flow (Tehrany et al. 2014; Ntajal et al. 2017; Arabameri et al. 2022).

In the study area, which is characterised by dry riverbeds and wadis, elevation plays a critical role in the occurrence and extent of flooding. Specifically, higher elevations tend to have steeper slopes, which can result in faster water flow during floods. This increases the risk of flash floods, which cause significant damage to infrastructure and property. Additionally, higher elevations may be more prone to rainfall-induced landslides, which can exacerbate flooding and lead to additional hazards. In contrast, lower elevations may be more susceptible to water accumulation, which can also increase the risk of flooding (Collins and Znidarcic 2004). Low-lying areas can serve as natural basins that collect and hold water during floods, resulting in prolonged flooding and greater potential damage. Low-lying areas can serve as natural basins that collect and hold water during floods, resulting in prolonged flooding and greater potential damage. In the case of As-Suwayq City, the significance of slope in flood susceptibility mapping may have been related to the specific geomorphological and climatic conditions of the area (Terry et al. 2022). The region is characterised by rugged terrain with steep slopes and a complex network of wadis, or dry riverbeds, which are rapidly filled with water during heavy rainfall events. The steep slopes in As-Suwayq increase the speed and erosive power of water, leading to channel instability, sediment transport and potential flooding downstream. The wadis in the area can also create narrow and confined channels that exacerbate the effects of high-water events, and the presence of boulders and debris further obstructs flow and increases flood risk. Moreover, As-Suwayq is located in a semiarid region that is susceptible to flash flooding, in which intense but short-lived rainfall events cause rapid runoff and the inundation of downstream areas (Choudri et al. 2013). Because of its steep slopes and semiarid climate, As-Suwayq is particularly vulnerable to flooding, which underscores the importance of accurate flood susceptibility mapping and mitigation measures in this area.

In the RF model evaluated in the present study, annual rainfall was identified as the third most important feature in predicting the target variable, with an importance score of 8.201, which was the third highest following curvature and elevation (see Table 2). This finding has indicated that rainfall patterns are essential in predicting the target variable in As-Suwayq. Because the city is located in a region with an arid climate, annual rainfall is a critical determinant of various geospatial phenomena, such as vegetation growth, soil moisture, and groundwater recharge. Similarly, the findings showed that, in the CatBoost model, distance from the river received an importance score of 15.59, marking it as the most important feature in predicting the target variable over terrain features, such as slope, TWI, and SPI. This finding indicates that proximity to wadis plays a significant role in predicting the target variable in As-Suwayq. Wadis, particularly Wadi Al-Jahawar, was found to significantly influence the city’s geospatial characteristics, impacting water availability (Yateem 2009), erosion patterns, and sedimentation processes. However, the city’s reliance on these wadis is vulnerable, as the Shaheen cyclone caused the collapse of 15 falajs, which are crucial for irrigation (Ibrahim et al. 2022). The interconnectedness of these elements highlights the complex relationship between natural phenomena and human activities.

In the present study, the identification of slope as a significant factor in flood susceptibility mapping of As-Suwayq indicates the need for measures that mitigate flood risk in the region, such as flood control structures, channel stabilisation, and land-use planning that take into account the specific terrain and climatic conditions of the area (Joseph et al. 2023). Thus, the region experiences irregular and infrequent rainfall because of its arid climate. However, when heavy rain does occur, it can overwhelm local drainage systems and cause flooding, especially in areas with poor drainage infrastructure (Etri et al. 2023). Oman’s proximity to the Arabian Sea makes it vulnerable to tropical cyclones, which can bring intense rainfall and flooding (Al-Awadhi et al. 2019). The interaction between these cyclones and local climatic conditions can significantly increase flood risks (Terry et al. 2022). The risk of flooding in Oman can escalate because of rapid urban development, unregulated urban expansion, and alterations in land use, impacting not just Wilayat As-Suwayq, but also the entire country (Al-Kindi et al. 2023). Paved surfaces and inadequate drainage systems can lead to surface runoff during heavy rains, exacerbating flooding. The quality and capacity of the infrastructure, including stormwater drainage systems, play a crucial role in managing flood risks. Inadequate infrastructure can leave the region vulnerable to flooding, especially in urban areas. The socio-economic status of the population can also impact their resilience to flooding. Lower-income communities in the city may have limited resources to prepare for and recover from flood events, making them more vulnerable (Mansour et al. 2021). Geological and soil factors play a crucial role in influencing flood risks in this region. The geological composition of the land can impact the speed and direction of water flow. Geological factors, such as the presence of fault lines, karst landscapes, or underground caves, can affect the movement of water and contribute to flood hazards. Therefore, impermeable rock or clay soils can lead to increased runoff, thus exacerbating flood risks (Collentine and Futter 2018).

The present study has revealed the critical factors influencing flood events in As-Suwayq, such as curvature, elevation, rainfall patterns, slope, NDVI, TWI, TRI, SPI soil types, drainage density, distance from drainage, and distance from roads. This information can aid in developing accurate flood forecasting models, implementing advanced flood warning systems, and guiding urban planning and infrastructure development. Recommendations for flood-resilient construction, drainage systems, and land-use zoning can be proposed based on these factors. Understanding soil and geological factors can guide land and water resource management, such as afforestation, soil conservation, and river channel maintenance. Preserving natural floodplains and wetlands can serve as a natural buffer against flooding. Climate adaptation strategies, such as building flood defences, elevating infrastructure, and enhancing coastal protection, can be recommended. The findings can also be used to educate the local population about flood risks and preparedness, and policymakers can revise disaster risk reduction and flood management policies. However, international collaboration may be necessary for effective transboundary flood management. Regular monitoring of geological, hydrological, and meteorological conditions is crucial. Additionally, ongoing monitoring of geological and hydrological conditions is necessary to respond to changing risk factors over time.

5 Conclusions

In the present study, As-Suwayq City’s vulnerability to flooding was thoroughly investigated using a wide range of scientific factors. These factors included elevation, proximity to rivers, drainage density, geological composition, soil type, annual precipitation, and topography. The study found that elevation was a critical factor, emphasising the need for floodplain management and zoning regulations in low-lying areas. Proximity to rivers and drainage density also significantly contributed to flood vulnerability, highlighting the importance of floodplain management and green infrastructure. Understanding local geology and soil conditions is essential for informed flood risk assessments and land-use planning. Annual rainfall levels played a pivotal role, necessitating the adoption of stormwater management systems in high-rainfall areas. Furthermore, the present study identified variables, such as SPI, TRI, and TWI, as helpful in identifying flash flood–prone regions, calling for improved drainage infrastructure, and reduced impervious surfaces. To protect As-Suwayq City from floods, a comprehensive and proactive approach that considers all these factors in flood risk management strategies and can enhance community safety and resilience is vital. To safeguard Wilayat As-Suwayq against floods, a complete and proactive approach must be devised. To improve community safety and resilience, such a strategy should combine all specified components into flood risk management strategies. More research should be done in the future to improve prediction models, investigate cutting-edge technology for early warning systems, and assess the long-term efficacy of existing flood control techniques. This commitment to continual study and adaptable techniques will be critical to build long-term solutions and ensure the community's safety in the face of shifting flood hazards.