Introduction

Landslides are one of the most common natural disasters in areas with rugged topography. Landslides triggered by seismic activities, heavy rainfall, or human activities cause significant economic damage, environmental destruction, and loss of life, especially in areas with steep slopes. Therefore, reducing and preventing landslide-related damages continue to be important research topics in disaster management (Zhang et al. 2022a). To reduce landslide-related damages and losses, landslide susceptibility (LS) assessment should be conducted in areas prone to landslides, and landslide susceptibility maps (LSMs) for these areas should be generated (Das et al. 2023). It is possible to identify areas with a high probability of future landslides through LS mapping (Ye et al. 2022).

Various statistical methods have been used to generate LSMs. Logistic regression (LR) (Yilmaz 2009; Kavzoglu et al. 2014; Dağ et al. 2020), frequency ratio (FR) method (Akgun et al. 2008; Demir 2019; Akinci and Yavuz Ozalp 2021), and weights of evidence (Lee and Choi 2004; Sifa et al. 2020; Wang et al. 2020a) are the most commonly used statistical methods. However, Wei et al. (2022) mentioned that while statistical methods are relatively easy to use, they fall short in explaining the complex, inconsistent, and nonlinear relationships between landslide events and the conditioning factors that influence them. Therefore, in recent years, machine learning (ML) and deep learning-based models have gained significant attention in this field.

In the past decade, individual or stand-alone ML algorithms such as artificial neural networks (Pradhan and Lee 2010; Aditian et al. 2018; Akinci 2022), decision trees (Tien Bui et al. 2012; Pradhan 2013), naive bayes (Chen et al. 2018; Pourghasemi et al. 2018), and support vector machines (SVM) (Goetz et al. 2015; Colkesen et al. 2016; Akinci and Zeybek 2021) have been widely used in LS mapping. On the other hand, several studies in the literature have reported that ensemble learning algorithms, such as adaptive boosting (AdaBoost), random forest (RF), gradient boosting machine (GBM), light gradient boosting machine (LightGBM), categorical boosting (CatBoost), and extreme gradient boosting (XGBoost), outperform individual ML algorithms in terms of prediction accuracy, stability, and robustness (Kavzoglu and Teke 2022; Liu et al. 2022; Wei et al. 2022). Ensemble learning algorithms generate multiple models using different samples of the same dataset to produce more accurate solutions and improve prediction performance by combining the predictions from these models. Sahin (2022) and Ye et al. (2022) emphasized that CatBoost, LightGBM, and XGBoost are rarely used and not extensively evaluated in LS mapping studies, mainly due to being relatively new methods compared to other ML models. Although tree-based ensemble learning algorithms demonstrate better performance than single models, it has not been determined yet which is the most suitable algorithm for LS mapping (Wei et al. 2022; Ye et al. 2022). Different studies highlight the prominence of different algorithms on the attributes of the research location, the conditioning factors used, and the values of the algorithms’ hyperparameters. Therefore, more studies comparing the performance of different algorithms are needed for a comprehensive evaluation.

In Turkey, due to its geological, geomorphological, topographic, and climatic characteristics, regions such as the Black Sea Region, Eastern Anatolia, and Central Anatolia are frequently affected by landslides. Approximately 20% of landslides in Turkey occur in the Eastern Black Sea Region. The provinces with the highest happening of landslides in this region are Trabzon, Rize, Giresun, and Artvin, respectively. According to the statistical data from the Disaster and Emergency Management Presidency (AFAD) covering the years 1950–2019, Trabzon experienced 1517 landslides, Rize had 1319, Giresun had 913, and Artvin had 765 landslide events (AFAD 2020). In the study by Dalkes and Korkmaz (2023), LSMs of Akçaabat and Düzköy districts of Trabzon province were produced using Analytic Hierarchy Process (AHP) and FR methods. Upon comparison of the LS maps produced by both methods, it was determined that the FR method yielded more accurate results than the AHP method in identifying the locations of the observed landslides in the study area. Yavuz Ozalp et al. (2023) produced LSMs of Ardeşen and Fındıklı districts of Rize province using tree-based ensemble learning algorithms such as RF, GBM, CatBoost and XGBoost. Using the ROC curve and AUC metric, the researchers found that CatBoost performed slightly better than other models. In the study by Kaya Topaçli et al. (2024), LR and RF models were used to produce LSMs of the Bolaman river basin of Ordu province, located in the Eastern Black Sea Region in Turkey. The study area is one of the basins where landslides occur most frequently in Turkey, as in many places in the Eastern Black Sea Region. The validation results of the study showed that RF is superior to LR in terms of performance. Landslides that inflict harm upon structures and infrastructure, also economic losses and loss of life, can be observed in all districts of Artvin. However, a review of the literature reveals that LS assessments have been conducted in the districts of Arhavi, Hopa, Kemalpaşa, Merkez, Ardanuç, and Şavşat in Artvin (Akinci et al. 2020, 2021; Akinci and Yavuz Ozalp 2021; Akinci and Zeybek 2021), while no studies have been conducted in the districts of Borçka, Murgul, and Yusufeli.

The aim of this study was to produce susceptibility maps showing areas prone to landslides in Murgul district of Artvin province and to identify the main factors contributing to landslide susceptibility. Prediction models using RF, LightGBM, CatBoost and XGBoost ML algorithms were used for susceptibility mapping. The prediction capabilities or performances of the susceptibility models were evaluated using metrics such as overall accuracy, precision, recall, F1-score, root mean square error (RMSE), and AUC-ROC. Furthermore, the SHapley Additive exPlanations (SHAP) approach, which aims to explain the local behavior of black box ML models, was used to improve the interpretability of model predictions. When the current literature is reviewed, it is seen that there are limited number of studies using SHAP approach in LS modeling (Pradhan et al. 2023; Sun et al. 2023a; Teke and Kavzoglu 2023; Vega et al. 2023; Youssef et al. 2023; Zhang et al. 2023). Therefore, the innovative aspect and contribution of this study to the literature is the use of SHAP in LS modeling.

Material and methods

The aim of this study is to produce LSMs of the study area using tree-based ensemble learning algorithms and to compare the performance of the ML models used. The study can be broadly divided into 6 steps: (1) prepare spatial data for landslide conditioning factors; (2) test the independence of conditioning factors using multicollinearity analysis; (3) collection of landslide inventory data, creation of training and validation data sets; (4) produce LSMs using RF, GBM, CatBoost, and XGBoost models; (5) evaluate the performance of the models using validation metrics; (6) interpret the models using SHAP values and explain the formation mechanisms of landslides in the study area. This study was carried out using ArcGIS 10.5, SAGA GIS 7.9, Python and Scikit-learn library. A flowchart summarizing the methodology applied in this study, inspired by Yu et al. (2023), is shown in Fig. 1.

Fig. 1
figure 1

The flowchart representing the methodology followed in the study

Study area and geological structure

The study area consists of a bounding box covering the Murgul district. This area is located between 41° 7′ 1.39″—41° 22′ 24.86″ north latitude and 41° 27′ 55.41″—41° 41′ 1.53″ east longitude, with a total area of 48,938.23 hectares. The average elevation in the study area is 1347 m, and the elevation ranges from 100 to 3370 m (Fig. 2). The study area has a highly rugged topography, and the slope values change from 0° to 76.33°. The mean slope value in the study area is 29.09°. In the study area, 4.09% of the area has a slope below 10°, 15.78% has a slope between 10° and 20°, and 80.13% has a slope above 20°.

The study area includes 13 villages along with the town center of Murgul. Two of these villages, Civan and Akpınar, are administratively affiliated with the Borçka district (Fig. 2). Based on the data from the Turkish Statistical Institute (TURKSTAT), the total population of Murgul district in 2021 is 6522 (TURKSTAT 2023). Out of this population, 5020 reside in the town center, while 1502 live in the villages. With the inclusion of the 2 villages affiliated with the Borçka district, the total population in the study area reaches 6896 people.

Fig. 2
figure 2

Study area: (a) location of Artvin, (b) location of Murgul, (c) landslides in Murgul

The study area is characterized by a Black Sea climate. Based on the meteorological measurements collected from the General Directorate of Meteorology for the years 2015–2022, the average temperature in Murgul district is 12.76 °C. The yearly mean rainfall in Murgul is 980.63 mm. The highest temperature recorded in the district was 31 °C in August 2022, while the lowest temperature was -3 °C in December 2016.

In this study, a 1/100,000 scale geological map obtained from the General Directorate of Mineral Research and Exploration (GDMRE) was used (Keskin 2013a; 2013b). Based on this map, there are 18 different lithological units in the study area (Fig. 3). The study area comprises lithological units ranging in age from Late Cretaceous (Turonian-Coniacian) to Quaternary (Table 1). Four lithological units account for 78.52% of the study area: Kızılkaya Formation (Kk) covers 24.28%, Kabaköy Formation (Tek) covers 23.66%, Çatak Formation (Kç) covers 20.6%, and Çağlayan Formation (Kça) covers 9.98%. The Santonian-aged Kızılkaya Formation (Kk) consists of rhyodacitic, dacitic lavas, and pyroclastics.

Fig. 3
figure 3

Lithological map of the study area

Table 1 Lithological units of the study area

The Kabaköy Formation (Tek), which is of Middle Eocene age, starts with clastic and carbonates and includes andesitic, basaltic lavas, and pyroclastics, as well as conglomerates, sandy limestone, sandstone, marl, and tuff. The Late Cretaceous (Turonian-Coniacian) Çatak Formation (Kç) comprises basaltic andesitic lavas and pyroclastics, along with argillaceous limestone, marl, siltstone, and shale. The Campanian–Maastrichtian-aged Çağlayan Formation (Kça) is composed of basaltic andesitic lavas and pyroclastics, as well as mudstone and sandstone (Keskin 2013a; 2013b). The rocks contained in the other lithological units in the study area are given in Table 1.

Landslide inventory

Landslide susceptibility (LS) mapping studies require landslide inventory data both for model training and validation stages. For models to generate reliable results, the inventory data needs to be up-to-date, accurate, and complete. The landslide inventory map (LIM) used in the study contains 54 landslide polygons (Fig. 2).

Among the landslide polygons, 11 of them were obtained from the 1/25,000 scale LIM generated by the GDMRE. In this inventory, landslide polygons are classified as Type 1, Type 2, and Type 4. Type 1 represents inactive landslides, Type 2 represents active landslides, and Type 4 represents active flows (Erener et al. 2016; Duman and Çan 2023). The remaining 43 landslide polygons used in the study were obtained from the map indicating the inventory of landslides produced by the Artvin Provincial Disaster and Emergency Directorate. When examining landslides and related data in Turkey, it is known that predominantly translational and rotational landslides, as well as complex landslides combining multiple types, occur (AFAD 2020). According to Varnes’ (1978) slope movement classification, out of the 43 landslides, 1 is categorized as rotational slide, 8 categorized as translational slide, 24 as flow, 6 as debris flow, and 4 as complex landslide. The total area covered by the landslide polygons is 9,241,303.99 m2. The landslide polygons cover approximately 2% of the study area. The smallest and the largest landslide polygons in the study area have areas of 38.32 m2 and 2,809,890 m2, respectively.

A total of 54 landslide polygons in the study area were transformed into raster format, utilizing a spatial resolution of 10 m. As a result of this conversion, the “1” value was assigned to 92,446 pixels, which are referred to as positive examples. To create negative examples, a matching quantity of pixels without landslides were randomly selected in the implementation of ML models in Python, and these pixels were assigned a value of “0”. Subsequently, the total of 184,892 pixels, consisting of landslide and non-landslide examples, was divided into two datasets in a 70/30 ratio, following the literature (Sahin 2022; Youssef and Pourghasemi 2021; Akinci 2022; Liu et al. 2022; Wei et al. 2022). These datasets were employed for the training and validation of the model.

Landslide conditioning factors

In this study, LSMs were generated using 13 factors, including land cover, aspect, slope, lithology, elevation, plan curvature, profile curvature, distance to drainage networks, distance to faults, distance to roads, slope length, topographic position index (TPI), and topographic wetness index (TWI). These factors were determined based on the availability or producibility of spatial data, the geological and environmental characteristics of the study area, and relevant literature. Raster-based factor maps were produced at a spatial resolution of 10 m using ArcGIS 10.5 and SAGA GIS 7.9 software (Fig. 4 and Fig. 5). Digital topographic maps of the study area were obtained from the General Directorate of Mapping. The digital elevation model (DEM) of the study area with a resolution of 10 m was produced by using the contour lines in these 1/25.000 scale topographic maps. Aspect, slope, elevation, plan and profile curvature were prepared in ArcGIS 10.5 software using this DEM. Additionally, slope length, TPI and TWI were produced using the same DEM in SAGA GIS software. As explained in Section "Study Area and Geological Structure", the lithological units and fault lines were obtained from the 1:100,000 geological map provided by GDRME. The map displaying the distance to faults in the study area was generated using the “Euclidean Distance” function in ArcGIS software. The study area’s drainage network was generated using SAGA GIS software based on the DEM. Additionally, the “Euclidean Distance” function of ArcGIS software was used to produce the distance map to the drainage networks. A comprehensive digital road network dataset including highways, village roads and forest roads in the study area was obtained from Artvin Regional Directorate of Forestry. The map representing the distance to the roads was produced using the “Euclidean Distance” function of ArcGIS software, as in other distance maps. On the other hand, the 10 m resolution land cover dataset of the study area was obtained from ESRI (https://livingatlas.arcgis.com/landcover/). The relationship between conditioning factors and landslide occurrences has been well explained in numerous studies (Gomez and Kavzoglu 2005; Kavzoglu et al. 2014; Pourghasemi and Rahmati 2018; Zhao et al. 2019; Dağ et al. 2020; Youssef and Pourghasemi 2021; Ye et al. 2022), hence detailed discussions on this topic are not provided in this article. Instead, basic statistical data related to conditioning factors are presented (Table 2).

Fig. 4
figure 4

Maps that represent conditioning factors: a) altitude, b) aspect, c) distance to drainage, d) distance to faults, e) distance to roads, f) land cover

Fig. 5
figure 5

Maps that represent conditioning factors: a) plan curvature, b) profile curvature, c) slope, d) slope length, e) TPI, f) TWI

Table 2 Statistical description of the conditioning factors and their sub-classes

Machine learning algorithms used in the study

Random forest (RF)

The Random Forest (RF) algorithm, originally proposed by Breiman (2001), is a type of ML algorithm designed for nonparametric multivariate classification. It has been widely adopted in LS mapping studies, and has been discussed in detail by Catani et al. (2013). The Random Forest (RF) algorithm is a well-known technique in the field of ensemble learning. It is frequently utilized in classification and regression tasks. In contrast to a single decision tree, which is susceptible to overfitting and may exhibit high variance or bias (Taalab et al. 2018; Park and Kim 2019), RF generates multiple instances of decision trees and aggregates their predictions to arrive at a final classification (Youssef et al. 2016). This approach allows the algorithm to mitigate the weaknesses of individual trees and improve predictive performance, making it a widely-used tool in the field of ML. Decision trees are created using randomly selected subsets of the training data. The final prediction produced by RF is achieved by aggregating the predictions of all the decision trees (Akinci et al. 2020).

Two parameters need to be defined when creating an RF, the number of decision trees (ntree) and the number of variables or factors used at each node of the decision tree (mtry). Although there is no definitive rule for selecting the number of trees in RF, augmenting the number of trees does not guarantee an enhancement in the model’s accuracy (Taalab et al. 2018). Conversely, the variable numbers utilized at each node of the decision tree should be equal to the square root of the total number of variables (Chen et al. 2020). The out-of-bag (OOB) error, which is the percentage of misclassifications over all out-of-bag factors, is used to estimate the generalization error and evaluate the importance of variables (Achour and Pourghasemi 2020; Cao et al. 2020). In this study, RF method implemented using scikit-learn Random Forest regressor with the ntree and mtry parameters set to 100 and 13, respectively. A tenfold cross-validation approach was used to validate the consistency of the model’s results.

Extreme gradient boosting (XGBoost)

XGBoost, introduced by Chen and Guestrin (2016) is a combination of the gradient boosting algorithm and the decision tree models (Wei et al. 2022; Cao et al. 2020). The XGBoost has gained popularity in LS mapping studies (Kavzoglu and Teke 2022). The main advantage of the XGBoost is its performance in the terms of runtime speed and accuracy (Wei et al. 2022). The XGBoost algorithm utilizes a gradient boosting technique, which constructs a tree by splitting features and recursively adding trees (Zhang et al. 2020a, b). For each time a new tree is added, a new function is yielded by fitting the residual value of the previous predictions. The tree is constructed by training the model, hence, the leaf node of the tree stores a score, and the sample’s predicted value is the sum of the scores of all nodes (Ye et al. 2022). The aim of the model is to minimize the difference between predicted value and true value by minimizing the loss function of the training data as shown in Eq. 1 (Wang et al. 2020b).

$${\varphi }^{t}=\sum_{i=1}^{n}l\left({y}_{i},\left({\widehat{y}}_{i}^{\left(t-1\right)}+{f}_{t}\left({x}_{i}\right)\right)\right)+\Omega \left({f}_{t}\right)$$
(1)

where i stands for the number of a given predicted value ŷ (i = 1, 2, 3, ⋯, n); n stands for the total number of y values; t stands for the iteration number; l(yi, ŷi) stands for the loss function between actual value yi and the predicted value ŷi; Xi stands for the features of the i’th sample; ft(Xi) stands for the base learner added to the tth iteration; ( ft) stands for regularization and finally t stands for objective function.

ML algorithms have some parameters that need to be tuned during the training phase. These parameters, called hyperparameters, significantly affect the accuracy, performance and processing speed of the model (Yavuz Ozalp et al. 2023). XGBoost includes several hyperparameters that can be tuned to improve the performance of the model. These include “n_estimators” (the maximum number of iterations or trees), “max_depth” (the maximum depth of the trees), “eta” (the learning rate), “gamma” (the regularization parameter), “colsample_bytree” (the number of features or variables supplied to a tree), “min_child_weight” (the minimum sum of instance weight needed in a child), and “subsample” (the number of samples or observations supplied to a tree).

Light gradient boosting machine (LightGBM)

Light gradient boosting (LightGBM), firstly proposed by Ke et al. (2017), is a gradient boosting approach. LightGBM is designed to overcome performance issues encountered by gradient boosting decision trees in data intensive applications (Zhang et al. 2022b). Thanks to its support for efficient concurrent training, faster training speed, reduced memory usage and distributed computation capabilities, it can handle big data efficiently and can be considered an improvement to the XGBoost (Dai et al. 2021).

The efficiency of LightGBM comes from two novel optimization techniques namely Gradient-based One-Side Sampling (GOSS) that diminishes the number of data instances and Exclusive Feature Bundling (EFB) which decreases the number of features (Zhang et al. 2022b). GOSS down-samples the data instances by discarding a portion of instances with small gradients and retaining instances with large gradients to evaluate information gain. More detailed information can be found in Ke et al. (2017). EFB reduces the number of features by grouping related features together. To be able to bundle features without compromising accuracy, conflict rate is used to determine whether a feature should be bundled or not (Fang et al. 2021).

When creating the LightGBM model, some hyperparameters need to be tuned. These basic hyperparameters are “boosting_type”, “num_leaves”, “max_depth”, “num_iterations”, and “learning_rate” (Zhou et al. 2022; Omotehinwa et al. 2023). The “boosting_type” parameter defines the gradient boosting method to be run. Valid values are “gbdt”, “rf”, “dart”, and “goss”, but the default is “gbdt”. The parameter “num_leaves” refers to the maximum number of leaves in a tree. The “max_depth” parameter controls the maximum depth of each tree. The parameter “num_iterations” determines the number of boosting iterations, or trees to build. The “learning_rate” determines the speed at which the model’s weights are updated after processing each batch of training examples.

Categorical boosting (CatBoost)

Categorical boosting (CatBoost), firstly introduced by Prokhorenkova et al. (2018), is another improved gradient boosting technique. CatBoost can process categorical data along with numerical data and needs less training data compared to other ML methods (Sahin 2022). Instead of repeatedly utilizing the same data in constructing trees, which leads to overfitting in Gradient Boosting, CatBoost uses an ordered boosting technique to combat this problem. Hence, combining the ordered boosting with the process of categorical values prevents a prediction shift stems from the special type of target leakage (Pham et al. 2022). More details about the algorithm can be found at Prokhorenkova et al. (2018).

CatBoost has seven commonly used hyperparameters: iterations, depth, learning_rate, l2_leaf_reg, random_strength, rsm, and border_count (Yavuz Ozalp et al. 2023). The “iterations” parameter specifies the maximum number of trees to be used during training. The “depth” parameter defines the depth of each decision tree. The “learning_rate” determines the step size at each iteration while moving toward a minimum of a loss function. The “l2_leaf_reg” is the coefficient for the L2 regularization term of the cost function. The parameter “random_strength”, which is used to prevent overfitting in the model, expresses the amount of randomness to be used to score splits when the tree structure is selected. The “rsm”, random subspace method, refers to the percentage of features to be used in each split selection when features are randomly re-selected. Lastly, the “border_count” refers to the number of splits for numerical features (AWS 2024).

Multicollinearity analysis

To enhance the accuracy of a landslide susceptibility analysis using an ML model, it is crucial to test the independence of the model’s input variables (Yu et al. 2023). Multicollinearity analysis is conducted to identify any linear correlation between the conditioning factors. The strong linear relationship between the variables can cause prediction results to be inaccurate and decrease the model’s accuracy (Song et al. 2023). In LS mapping studies, the most commonly used indicators for multicollinearity are tolerance (TOL) and variance inflation factor (VIF), which can be calculated using Eqs. 2 and 3 (Kavzoglu et al. 2014; Bai et al. 2015; Arabameri et al. 2020; Wang et al. 2020a, b; Yi et al. 2020; Wei et al. 2022). The VIF is a measure of the increase in the variance of a regression coefficient due to multicollinearity. The TOL is the reciprocal of the VIF value and can also be used to test for multicollinearity between variables (Yu et al. 2023). In Eq. 2, R2 represents the proportion of variance in the target variable (Ye et al. 2022). If the VIF value is greater than 10 or the TOL value is less than 0.1, it indicates a multicollinearity problem, and the multicollinear variables should be removed from the susceptibility models.

$$TOL=1-{R}^{2}$$
(2)
$$VIF= \frac{1}{TOL}= \frac{1}{1-{R}^{2}}$$
(3)

Performance assessment metrics

Akinci and Akinci (2023) emphasized that validation is an essential procedure to evaluate the performance of the models. An LSM that has not been validated holds no scientific value. Sahin (2022) stated that different accuracy metrics can be used to evaluate the performance of ML models. In this study, overall accuracy, precision, recall, F1-score, RMSE and area under the receiver operating characteristic curve (AUC-ROC) metrics have been used to validate the results and compare and evaluate the performance of four different ML models. Except for RMSE, the performance assessment metrics given in Table 3 are calculated using the components of the confusion matrix.

Table 3 Performance assessment metrics

In the equations in Table 3, TP (true positive) and TN (true negative) express the number of pixels correctly classified as landslide and non-landslide, respectively; FP (false positive) and FN (false negative) denote the number of pixels misclassified as landslide and non-landslide, respectively (Ye et al. 2022).

SHapley additive exPlanation (SHAP)

In applications such as LS mapping, it is crucial to understand why and how a model makes a particular prediction, as well as prediction accuracy. However, the increasing arithmetic power and complexity of machine learning models makes it difficult to understand their internal mechanisms, local behaviors and decision-making processes (Zhang et al. 2023). The new generation of AI models, called explainable or interpretable AI (XAI), aims to explain the local behavior of black box models (Youssef et al. 2023). Lundberg and Lee (2017) proposed SHAP (SHapley Additive exPlanation) to provide explanations for the prediction reasons of various machine learning models, particularly opaque black box models. The SHAP method quantifies the impact of each feature on the model’s prediction. It achieves this by calculating the sum of the Shapley values of each input feature. This improves comprehension of how a model generates predictions (Zhang et al. 2023; Teke and Kavzoglu 2023).

Results

Multicollinearity analysis

Multicollinearity, in its simplest definition, is the presence of high correlation or linear relationship between independent variables in a regression model. Multicollinearity causes the results obtained from the model to be inaccurate. Therefore, multicollinearity analysis is applied in LS studies to test whether there is a high correlation between conditioning factors. This analysis has two natural results: i) there is no multicollinearity among the factors, ii) there is multicollinearity among some factors and the factors found to be correlated should be removed from the model. There are many studies with these two results in the literature. For example, in the studies conducted by He et al. (2023), Song et al. (2023), Vega et al. (2023) and Yu et al. (2023), it was found that there was no serious multicollinearity problem between conditioning factors.

Table 4 displays the outcomes of the multicollinearity analysis conducted on the conditioning factors employed in this study. The initial findings indicate a significant correlation between slope and TRI. Consequently, TRI was eliminated from the model, and a subsequent multicollinearity analysis was conducted on the remaining 13 factors. The highest VIF value 3.918890 and the lowest TOL value was 0.255174 (Table 4). Since the highest VIF value was less than 10 and the lowest TOL value is higher than 0.1, there was no multicollinearity problem between the conditioning factors used in this study.

Table 4 Multicollinearity Analysis Results

In the LS study conducted by Yavuz Ozalp et al. (2023) in Ardeşen and Fındıklı districts of Rize (Turkey), multicollinearity analysis was performed for fifteen conditioning factors. As in this study, the researchers found that there was a high linear correlation between slope and TRI, and TRI was excluded from the susceptibility models. Wang et al. (2022) found collinearity between the factors of slope, elevation variation coefficient, surface cutting depth, relief amplitude, and surface roughness. Since slope is very effective in the occurrence of landslides, the other four factors were eliminated.

Landslide susceptibility mapping

In this study, firstly, landslide susceptibility index (LSI) maps of the study area were produced using RF, CatBoost, XGBoost and LightGBM algorithms. The algorithms were implemented in python programming language using the scikit-learn library. The XGBoost algorithm was implemented with default hyperparameter values (n_estimators = 100, max_depth = 6, eta = 0.3, colsample_bytree = 1, min_child_weight = 1, subsample = 1, gamma = 0). In order not to favour one algorithm over the others and to make an objective comparison, the number of trees was set to 100, the maximum tree depth to 0.6 and the learning rate to 0.3 for the other algorithms. The LSI shows the degree of susceptibility of each pixel in the study area to landslide formation. The higher the LSI value in a pixel, the higher the probability of landslide occurrence in that pixel, the lower the LSI, the lower the probability of landslide occurrence (Wubalem 2021). After the LSIs were created, they were transferred to ArcGIS software where they were reclassified and divided into five different susceptibility classes: very low, low, medium, high and very high. Thus, LSMs of the study area were obtained (Fig. 6). This classification was achieved by utilizing the natural breaks algorithm (Jenks 1967).

Fig. 6
figure 6

LSMs: a) RF, b) CatBoost, c) XGBoost, d) LightGBM

Table 5 presents the areal distribution ratios of susceptibility classes for each ML model. According to the areal distribution of susceptibility classes in Table 5, most of the study area is not susceptible to the landslides. 85.89% of the study area in RF, 71.86% of the study area in CatBoost, 70.16% of the study area in XGBoost and 77.31% of the study area in LightGBM is classified as very low and low susceptibility. Only 8.74% of the study area in RF, 12.85% of the study area in CatBoost, 11.37% of the study area in XGBoost and 10.43% of the study area in LightGBM is highly and very highly susceptible to the landslides.

Table 5 Comparison of the results of the ML models

Performance assessment and comparison

In this study, various metrics including overall accuracy (OA), precision, recall, F1-score, RMSE and area under the receiver operating characteristic (ROC) curve (AUC) were employed to compare and evaluate the performance of the LS models. Model assessment metrics were applied for both training and validation stages. In the evaluation made in terms of OA, it was determined that the model with the lowest accuracy value for both training and validation stages was RF (Table 6). Although CatBoost, XGBoost and LightGBM have close accuracy values, LightGBM provided slightly better accuracy than other models in both training and validation stages. LightGBM also outperformed the other models in terms of precision, recall and F1-score. However, it is a known fact that ML models may perform differently depending on both the characteristics of the study area and the conditioning factors used.

Table 6 Performance of the ML models in training and validation stage

One of the commonly used metrics to measure the accuracy of regression models is RMSE (Trinh et al. 2023). RMSE is used to measure the prediction error of the model, and the closer it is to 0, the better the performance of the model (Nguyen et al. 2019; Ado et al. 2022; Trinh et al. 2023). Compared to precision, recall, and F1-score, XGBoost outperformed the other models with the lowest RMSE values.

In this study, the AUC metric, which is the most widely used performance assessment metric in LS mapping studies, was also used. Figure 7 shows the ROC curves and AUC values of the models. When Fig. 7 is examined, it is seen that the XGBoost model has the highest AUC value (0.9773), followed by the LightGBM (0.9751), CatBoost (0.9708) and RF (0.8976) models, respectively. As with other evaluation metrics, RF lagged behind other models in terms of AUC. In the study by Sahin (2022), RF also lagged behind other ensemble learning algorithms such as CatBoost, XGBoost and LightGBM in terms of AUC value.

Fig. 7
figure 7

ROC curves and AUC values of the models

Interpretation of models with SHAP

In LS mapping studies, it has been observed that some factors may have minimum importance in model predictions, while others may be more effective in this process. Yu et al. (2023) emphasized that determining the relative importance of conditioning factors can help to better understand the causes of landslides in a region. The importance values of conditioning factors for RF, CatBoost, XGBoost, and LightGBM are presented in Fig. 8. The importance values of the conditioning factors vary between the different models. However, lithology, altitude, distance to faults, and aspect consistently rank in the top 4, while slope length, TWI, plan curvature, and profile curvature rank in the bottom 4.

Fig. 8
figure 8

Importance of the conditioning factors

The effects of conditioning factors on model outputs are interpreted not only by interpreting Fig. 8 but also by using SHAP values. Kavzoglu and Teke (2022) emphasized that unlike the feature importance function of ML algorithms, SHAP can determine whether the factors contribute positively or negatively to the model outputs. SHAP summary plot, one of the graphs offered by the SHAP library, is used to visualize the effect of each factor of the model on the prediction. Figure 9 and Fig. 10 shows the SHAP values of the thirteen conditioning factors for RF, CatBoost, XGBoost and LightGBM. As can be seen in the SHAP summary plots, although the effects of the conditioning factors on the model outputs vary in different models, altitude, lithology, aspect, distance to faults and slope were found to be the most effective factors. On the other hand, SHAP values revealed that slope length, TWI, profile curvature and plan curvature were least effective compared to other landslide conditioning factors. The results appear to be partially consistent with previous studies. For example, in the LS mapping study carried out by Akinci (2022) in Arhavi, Hopa and Kemalpaşa districts of Artvin, it was determined that TWI, slope length and curvature were the least important factors in the occurrence of landslides. In the study conducted by Akinci and Zeybek (2021), landslide susceptibility maps of Ardanuç district of Artvin province were produced using LR, SVM and RF models. In this study, the researchers determined that lithology, altitude and distance to the road parameters were the most effective factors, while TWI and curvature parameters were the least effective factors. In the study by Can et al. (2021), the predictions of the XGBoost-based landslide susceptibility model were interpreted using the SHAP summary plot. Similar to this study, the researchers determined that lithology and altitude have a greater impact on the prediction results, while profile curvature has the least impact on the prediction results.

Fig. 9
figure 9

SHAP summary plots: a) RF, b) CatBoost

Fig. 10
figure 10

SHAP summary plots: a) XGBoost, b) LightGBM

On the other hand, we interpreted the local contribution of the conditioning factors to the model outputs using SHAP waterfall plots (refer to Fig. 11 and Fig. 12). In the waterfall plot, the red arrow indicates that the SHAP value of the factor is greater than zero, meaning that the factor provides a positive gain to the landslide, while the blue arrow indicates a negative gain (Chang et al. 2023). Figure 11 shows that the distance to faults in the RF model and aspect in the CatBoost model had the largest negative impact on the occurrence of landslides in the study area. Lithology was also found to be a significant factor in both models. According to Fig. 12, altitude, aspect, distance to faults, distance to drainage and lithology in the XGBoost model (Fig. 12a) and distance to faults, altitude, land cover, aspect and slope in the LightGBM model (Fig. 12b) were determined as the factors with the largest positive contribution to the occurrence of landslides in the study area.

Fig. 11
figure 11

SHAP waterfall plots: a) RF, b) CatBoost

Fig. 12
figure 12

SHAP waterfall plots: a) XGBoost, b) LightGBM

Discussion

Comparison of factors

ML algorithms have different decision-making mechanisms. Therefore, in ML models using the same factors, the contributions of the factors to the model predictions are also different. However, ML models can only indicate the relative importance of factors on prediction results through feature importance functions. This function is insufficient to interpret how the models work when making predictions, which is why ML models are often referred to as “black box” models. To overcome this problem and explain the decision-making mechanism of ML-based landslide susceptibility models, methods such as SHAP should be used (Lu et al. 2023; Pradhan et al. 2023). The interpretability of a model is directly proportional to the ease of understanding the reasoning behind a specific decision. For this reason, we see that the SHAP method is used in studies such as LS mapping (Pradhan et al. 2023; Vega et al. 2023; Zhang et al. 2023) and wildfire susceptibility mapping (Iban and Sekertekin 2022).

When SHAP summary plots for XGBoost and LightGBM models are analyzed, it is seen that elevation and lithology are the most effective factors in model outputs (Fig. 10). This means that areas with more anthropogenic activities and weak lithological units are prone to landslides. Agricultural areas are often concentrated at specific altitudes and aspects within settlements, leading to increased human activity in these regions. Uncontrolled irrigation and excavation works can trigger landslides in these areas.

Comparison of models

In this study, the prediction performance of ensemble learning algorithms were evaluated using accuracy metrics (OA, F1-score, precision, recall, RMSE and AUC) commonly used in LS mapping studies. There are many recent studies in the LS mapping literature that use these metrics to assess the accuracy of susceptibility models (Kavzoglu and Teke 2022; He et al. 2023; Sun et al. 2023a, b; Vega et al. 2023; Yu et al. 2023; Zhang et al. 2023). Kavzoglu and Teke (2022) suggest that OA is a reliable metric for measuring model robustness in LS mapping studies. Ye et al. (2022) explain that the F1-score ranges from 0 to 1, and a value close to 1 indicates model reliability. As with OA, the model with the lowest F1-score was again the RF (Table 6). It was seen that the highest F1-score belonged to the LightGBM model and this model had a high classification capacity for both the training and validation dataset. In summary, LightGBM was found to be slightly superior to other models in all metrics except AUC. In the LS mapping study conducted by Sun et al. (2023b), OA, precision, recall and F1-score metrics were used to evaluate the accuracy of XGBoost and LightGBM models, and as in this study, LightGBM gave better results than XGBoost in all metrics. Tree-based ensemble learning algorithms are widely used in flood and forest fire susceptibility mapping apart from landslide susceptibility. For example, Saber et al. (2022) was used RF, LightGBM and CatBoost models for flash flood susceptibility prediction and determined that LightGBM outperformed other models in terms of evaluation metrics and processing time.

Another statistical criterion used in the study was AUC, which represents the area under the ROC curve. The AUC value of an ROC curve ranges from 0.5 to 1, and an AUC value close to 1 indicates that the model is excellent (Sahin 2022). According to Chen et al. (2017), Jiao et al. (2019), and Wu et al. (2020), the AUC value is generally classified in 5 ways: poor (0.5–0.6), average (0.6–0.7), good (0.7–0.8), very good (0.8–0.9), and excellent (0.9–1). According to this classification, RF performed very good while CatBoost, LightGBM and XGBoost performed excellent. When evaluated in terms of AUC, it is seen that this study is consistent with other studies in the literature. Because in many studies in the literature, it has been stated that ensemble learning algorithms are more successful than single ML algorithms (Kavzoglu and Teke 2022).

On the other hand, in order to assess the practical usability of an LSM, it is important to consider the accuracy of the ML model as well as the rationality of the generated LSM. To ensure rationality, an LSM needs to be able to accurately classify existing landslides in the study area. In this study, the LSMs generated were compared with the LIM, and the distribution of landslide pixels across the susceptibility classes in the LSMs was determined (Table 5). Such an LSM is considered rational if the existing landslides in the LIM remain in areas of high susceptibility as much as possible, and the areas of very high susceptibility in the LSM are as small as possible (Guo et al. 2021). In the RF, CatBoost, XGBoost and LightGBM models, the percentages of very high susceptible areas were determined as 4.53%, 5.95%, 4.93% and 5.25%, and percentages of landslide were determined as 81.123%, 94.955%, 85.614% and 87.937%, respectively. The frequency ratios were 17.9079, 15.9588, 17.3659, and 16.7499 (Table 5). LSMs showed a similar trend in general. As the susceptibility levels increased, the frequency ratios also tended to increase. In all models, the highest number of landslides was observed in the very high susceptible class. This comparison showed that the landslide occurrence rate gradually increased from very low prone areas to very high prone areas and the LSMs produced were reasonable. However, the LSM produced by the CatBoost model seems to be more rational and reasonable than other models. Because in this model, approximately 99.6% of the existing landslides remain in high and very highly landslide susceptible areas. In the LS mapping study by Yavuz Ozalp et al. (2023), RF, GBM, XGBoost and CatBoost algorithms were used. The researchers also evaluated the rationality of the LSMs produced by these algorithms. The evaluation results showed that the LSMs produced by CatBoost and XGBoost models are reasonable and rational for the study area.

Limitations and future suggestions

The ML models used in the study were able to predict the probability of landslide occurrence in the study area well. However, it is considered that the study has an important limitation. As it is known, ML models need accurate, up-to-date and complete data during the training phase. If the data is not complete or up-to-date, the model may perform poorly because it is not trained correctly. In this context, landslide inventory data are of vital importance in ML-based landslide susceptibility mapping studies. The most important constraint in this study is related to the up-to-dateness of landslide inventory data. Because Artvin Provincial Disaster and Emergency Directorate last updated the landslide inventory map of the study area in November 2016. Since landslides that occurred in recent years were not included in the inventory map, the models may have been trained with missing data. This situation was also mentioned in the Artvin Provincial Disaster Risk Reduction Plan and updating the landslide inventory maps was added to the action plan. Therefore, future studies in the study area should primarily focus on updating landslide inventory maps using field surveys and high-resolution satellite images. In addition, deep learning algorithms such as CNN and RNN can be used in future studies to produce highly accurate landslide susceptibility maps and the performance of deep learning algorithms can be compared with tree-based ensemble learning algorithms.

Conclusion

Artvin is a province prone to landslides due to its highly rugged topography. Landslides constitute a large part of the natural disasters occurring throughout the province. In the provincial disaster risk reduction plan, it is stated that landslides are the priority disaster type in the region and various strategic actions are proposed for risk reduction. Landslides are observed in all districts of Artvin. Landslides in the region cause damage to transport networks, energy transmission and drinking water lines, degradation in agricultural and forest areas, damage to buildings and loss of life. Therefore, landslide susceptibility mapping is of crucial importance to reduce the damages caused by landslides in all districts of Artvin province. In this study, the landslide susceptibility of Murgul district of Artvin province was analyzed. Four different ML models, random forest, XGBoost, LightGBM and CatBoost were employed to generate LSMs. The models considered multiple factors, including land cover, aspect, slope, lithology, elevation, plan curvature, profile curvature, distance to drainage networks, distance to faults, distance to roads, slope length, topographic position index (TPI), and topographic wetness index (TWI). The LIM provided by the GDMRE and Artvin Provincial Directorate of Disaster and Emergency, which documented 54 landslide polygons, was utilized in the analysis. The resulting LSMs classified the study area into five classes: very low, low, medium, high, and very high, based on the natural breaks (jenks) classification. The performance of these models was evaluated using metrics like overall accuracy, precision, recall, F1-score, RMSE and AUC-ROC. The models that performed the best were XGBoost in terms of RMSE and AUC values, and LightGBM in terms of accuracy and F1-score. The LSMs produced through XGBoost and LightGBM are highly valuable for landslide risk assessment and risk mitigation studies in the study area. The maps produced can guide engineers for disaster and emergency planning, decision makers and urban planners for sustainable land use planning and insurance companies for natural disaster insurance.