Keywords

1 Introduction

As one of the prevalent types of natural hazards in the world's terrestrial environments with slopes (Froude and Petley 2018), landslides cause a large number of injuries, deaths and socio-economic losses every year (Corominas et al. 2014). The quantification of landslide susceptibility has also becoming increasingly important (van Westen et al. 2006; Gariano and Guzzetti 2016), evolving from the earliest qualitative comments to quantitative analysis (Ayalew and Yamagishi 2005), it has always been a hotspot for research by geologists and scholars from all over the world (Aleotti and Chowdhury 1999). The quality of its assessment is inextricably linked to the selected indexes and model. Adequate understanding of the factors of landslides is needed to evaluate the susceptibility of landslides. (Bozzano et al. 2010). Topography, geological formations (Xing et al. 2014) and climatic conditions (Gariano and Guzzetti 2016), are the main factors that influence the susceptibility of landslides. These will also serve as reference parameters for the identification of further indicators. Commonly used models for landslide susceptibility assessment include expert system models such as Analytic Hierarchical Progress (Kayastha et al. 2013), Expert Scoring method (Aleotti and Chowdhury 1999); Mathematical and Statistical models such as the Information Value Model (Che et al. 2012), Entropy Weighting scheme (Devkota et al. 2013), Right of Proof (Regmi et al. 2014), Logical Regression (Kavzoglu et al. 2014), and Machine- Learning models (Devkota et al. 2013), such as Decision Tree (Guo et al. 2021), Random Forest (Catani et al. 2013), Neural Network (Yang et al. 2019) and support vector machines (SVMs) (Pradhan 2013; Huang and Zhao 2018). Each of these models has its strengths, as well as certain weaknesses (Dou et al. 2019).

Compared with the expert system model, which is affected by human factors, and the machine learning model, which is difficult to debug the parameters, the Information Value Model (hereinafter referred to as IVM) as well as the Frequency Ratio model (hereinafter referred to as FR), both belonging to mathematical-statistical models, have the advantages of simple operation, wide application and good objectivity (Sharma et al. 2015; Wenyan and Xile 2020). In addition, it is possible to scientifically grade the intervals of each indicator. The emerging artificial intelligence algorithms are capable of high-speed processing of massive data and self-organized learning (Liu et al. 2020). Random forest algorithm (hereinafter referred to as RF) RF is an artificial intelligence algorithm known for its simplicity and efficiency, it integrates the decision tree as a unit for integrated learning tree classifier combination algorithm, integrates the Bagging algorithm and random selection of feature splitting characteristics, high accuracy, able to handle large data and evaluate the importance of each factor in classification (XinHai 2013). When there are many disasters and indicators to compute, the advantage of advanced computer technology is relevant (Genuer et al. 2010). Landslide susceptibility modeling using RF enables the importance of features to be measured using the impurity of the Gini index calculation node in the model (Archer and Kirnes 2008). It is applied to landslide sensitivity analysis without the need to set the factor weights in advance, and can be run on the visualization software SPSS Model, which is easy to use.

In recent years, the number of landslides has increased exponentially under global warming (Pei et al. 2023). The permafrost regions of northeastern China are also affected by climate impacts such as rising temperatures and frequent extreme precipitation (Haque et al. 2019). Continuing permafrost degradation (Shan et al. 2022) increased development of geologic hazards such as ground thawing, ground collapse, landslides, all of them caused by freeze-thaw processes. Landslides in mountainous regions such as the Great and Lesser Khingan Mountains and Taihang Mountains are widely scattered and the damage is very serious, causing great losses to the society and economy while destroying the ecological environment, which has severely constrained the construction and social development of the regions.

According to the China Environmental Science Data Center, there are more than 13,700 landslides in the permafrost regions of northeastern China (Fig. 1a). However, due to the high latitude and altitude of the Greater and Lesser Hingganling permafrost regions, as well as the ecological and climatic constraints, it is difficult to carry out comprehensive disaster statistics in the field, and the number of landslide hazards counted is much less than the actual. Therefore, it is of great significance to establish a scientific and reasonable landslide susceptibility assessment model and draw a high-precision landslide hazards map to provide support for accurate landslide prevention, control, and management. However, most of the landslide susceptibility assessment models commonly used in the existing research are targeted as fast landslides distributed in low-latitude areas with warm climates and high precipitation. Due to the difficulty of field surveys in the permafrost regions and the slow rate of landslide occurrence affected by permafrost thawing, It remains to verify the applicability and predictive accuracy of landslide susceptibility assessment models commonly used in current studies when used in permafrost areas.

Fig. 1
2 maps and 6 photos. A, zoomed-in map of the northeast China with the location of the study area marked northwards. B, study area with B H road extending from north to the south. Hazard distribution locations are centered. C to h are photos of slope landslides by roads and bridges. The dotted lines mark the sections.

(a) Geographic location of the study area; map of the distribution of different types of permafrost, elevation, and historical hazard sites in northeast China. (b) Showing landslides related to the BH highway roadside infrastructure, the distribution of major transportation roads, and the distribution and topography of the major bodies of water. (c-h) Uncounted landslides in parts of the study area obtained from the field reconnaissance, which are slow landslides caused by the thawing of permafrost, which has had a significant impact on the nation's roadway infrastructure, operational services, the environment, the progress of works in progress, and the economy are negatively affected

Therefore, this study aims to use the frequency ratio model, the information value model, and the random forest model implemented by GIS to model landslide susceptibility in the study area, use the ROC curve to verify the simulation results of the model and to use the random forest method to judge the degree of influence of the assessment factors on the development of landslides. Such approach is to objectively verify the accuracy and applicability of the existing commonly used landslide evaluation models for susceptibility assessment and mapping in permafrost regions, and to provide intuitive and effective references for early warning of landslide disasters in permafrost regions.

2 Study Area

The study area is the Beian to Heihe highway road area (hereafter referred to as BH) is located in the northwestern section of the Lesser Khingan Mountains in Northeast China, which ranges between longitude 127°17′31″~127°21′24″E and latitude 49°30′57″~49°41′50″N, and it is a region with the distribution of the island permafrost in China, which belongs to the low altitude, high-latitude permafrost area (Fig. 1b). The main mountain ranges within the study area are north-east and south-west trending, and the terrain is in the form of a narrow belt. The geological feature is a truncated middle or low mountain with granitic volcanic rocks as the main constituent lithology. The elevation is high in the east and low in the west, high in the south and low in the north. In the study area, terrestrial sedimentation is accompanied by marine sedimentation, and the rock types are mostly a combination of mudstone, granite, volcanic rocks, siltstone, and locally clay and loess. The slope of the area is mostly in the range of 0–50%. Land use and land cover in the area is mostly cropland, wasteland, mixed or dense forests. Many of the roads are flanked by unmanaged and muddy slopes, which increases the landslide susceptibility of the area. Landslides are frequent in the study area, but many of them are not registered by the Geological Hazard Census. Through field investigation, there are dozens of slow-moving landslides caused by permafrost degradation along the BH alone. Which buried and destroyed retaining walls, blocked the highway (Fig. 1c, g), damaged transportation infrastructure such as embankments and retaining walls (Fig. 1d, h), and damaged unfinished construction works in process (Fig. 1e) which affected the progress of the project and resulted in economic losses. Considering the large amount of permafrost coverage in China, the economic importance of transportation in Northeast China, and the need for better landslide-related safety risk management and monitoring, the study of landslide susceptibility in permafrost regions is of great significance.

3 Materials

The factors affecting the development of landslides are complex, which are interconnected with each other (Nguyen et al. 2023). For example, the elevation data can reflect the topographic relief changes in the study area, and to some extent, it reflects the gullies, the vegetation changes, and the state of the accumulation. Slope orientation affects landslide development mainly by influencing the degree of weathering of hillside rocks and the growth of vegetation.

Slope gradient and profile curvature represent the degree of concavity and convexity of the terrain surface, which indirectly affects the extent of landslide development. (Yalcin 2008). River gullies also provide a source of hydraulic drive for permafrost thaw-affected landslides (van Westen et al. 2008). The selection of landslide assessment factors in this paper are mainly based on the distribution and development status of landslide hazards in the study area. After fully considering the difficulty of obtaining information in the study area as well as the scale, the distance from the water network (DW), the distance from the road (DR), the altitude, the direction of the slope, the slope gradient, the curvature of the profile, the degree of topographic relief, the land use, the Normalized Difference Vegetation Index (NDVI), the precipitation, the stratum lithology, and the distance from the fault (DF) have been selected as the assessment factors. The raster size, and data source of each factor are shown in Table 1.

Table 1 Data sources for evaluation factors

In the mathematical and statistical model, the above factors are used as input data, and the detailed grading of each indicator is superimposed and analyzed with the landslide historical hazard location layer. In the IVM model, the information value is derived through the superposition analysis. Later on, rankings with zero value were preferentially merged with neighboring ones followed by the merging of rankings with similar value with neighboring scoring. Finally, calculate the consolidated grading values, to realize the optimal grading status of each evaluation index. The landslide susceptibility zoning map was drawn accordingly. Precipitation in the figure is annual precipitation. Stratigraphic lithology is classified according to the degree of susceptibility to landslides, with 1 being the area or lithology least susceptible to landslides, such as water bodies, beach wind deposits sands. 7 is the type of lithology most susceptible to landslides, such as collapsible loess with stratigraphic age Q3p, Qh. Since more than 300 types of soils and rock classifications are involved in the study area, they will not be enumerated when classifying them on the map.

4 Methods

Many countries and regions in the world, such as the United States, Australia, Western Europe, have successively carried out research on the risk mapping of geologic hazards, with landslide hazards as the main theme subject, while China's research work in this field is relatively weak. Whether in the theoretical study of landslide disaster risk assessment or the application of practice, the beginning is relatively late. (Huang 2007). With further work in disaster mitigation and prevention in China, the assessment of the risk of large-scale regional landslide disasters has been gradually carried out. Currently, the application of 3S technology is booming in the field of geosciences with GIS technology as the core, which provides a fruitful technical platform for landslide disaster risk assessment. (Guzzetti and Reichenbach 2000; Chae et al. 2017). GIS has a powerful spatial analysis function and spatial database management capability, which can analyze the statistical relationship between the occurrence of landslide disasters and environmental influencing factors from different spatial and temporal scales, and evaluate the probability of occurrence of landslide disasters and possible disaster consequences. (Yalcin 2008). In this study, several commonly used disaster susceptibility assessment models based on GIS were used to evaluate landslide susceptibility in the Lesser Khingan Mountains study area (North-Eastern China), to provide a reference for the risk prediction of landslide susceptibility in permafrost regions.

4.1 Information Value Model

IVM is a statistical analysis method developed based on information theory (Yin and Yan 1988; Westen 1993). It is an effective method for regional geohazard prediction, and the viewpoint of information prediction is that the generation of geohazards is related to the quantity and quality of information obtained in the prediction process, which can be expressed as the amount of information. The Geohazard phenomenon is affected by many factors. The size and nature of the role of various factors are not the same; in a variety of different environments, for the regional geohazard elements of a comprehensive study of the “optimal combination of factors”, rather than staying in a single factor. Relying on the GIS platform to calculate the information content of individual indicators, and then weighted superposition of multiple indicators is the primary approach for comprehensive information content, to establish the assessment model of landslide susceptibility. Information content of the study area can be calculated by Eq. 1.

$$ I=\sum \limits_{j=1}^n\ln \frac{N_j/N}{S_j/S} $$
(1)

where I is the total information weighted by various assessment indicators, which can be used as a landslide susceptibility index; Nj is the number of landslides contained within a specific grading interval of a single evaluation indicator; N is the total number of landslides; Sj is the number of rasters within a specific grading interval of a single assessment indicator; and S is the total number of rasters.

The state-graded informativeness value of each evaluation indicator was assigned to the raster layer, and the spatial analysis tool was used to superimpose the informativeness raster layer of each evaluation indicator. This is to obtain the total informativeness value and, in the next step, to divide the total informativeness raster layer according to the watershed unit. The average of the total informativeness of the unit of the study area was finally used as the informativeness value of the sub-watershed and then reclassified (Fig. 2). The entire study area was categorized into four landslide susceptibility zoning classes: low susceptibility, medium susceptibility, high susceptibility, and very high susceptibility.

Fig. 2
12 grading maps of the study area. Road buffer is between 1000 to 1500, fault buffer is less than 3000, river buffer is less than 3000, curvature is negative 3 to 1, precipitation is 590 to 630, N D V I is less than 0.9, elevation is 320 to 400, slope is mostly 0 to 10, relief amplitude is 3 to 6, land use is forest, and aspect is east.

Grading map for each assessment factor. a DR. b DF. c DW. d Profile curvature. e Precipitation. f NDVI. g Elevation. h Lithologic Hazard Level Classification. The danger level increases with increasing values, with 7 being the most hazardous. i Slope. j Topographic relief. k Land use. l Slope direction

4.2 Random Forest Model

Random Forest Algorithm is an artificial intelligence algorithm known for its simplicity and efficiency. It uses decision trees as units for integrated learning of the treelike classifier combination algorithm and Integrates the features of the Bagging algorithm and the Random Choice Splitting method. The outcome is providing results with high accuracy and assessing the importance of each factor in classification, also as consequence of the capacity to handle large amount of data. The number of landslide hazard sites in the study area was categorized and 70% (35) of the hazard sites were used as a training set for random forest training calculations, and the remaining 30% (15) were used as a validation dataset to validate the results concerning the prediction rate.

Twelve assessment factors were selected through the characteristics of landslide development in the study area. Multiple sampling and training of decision tree models were performed. The individual decision trees generated were used to form a random forest, whose prediction results were averaged from the final decision tree, The sensitivity grading was performed based on the model prediction results, and a landslide sensitivity evaluation map of the study area was generated using GIS. The use of random forests for landslide sensitivity modeling also enables the use of the Gini index to calculate the impurity of nodes to measure the importance of features. The model is then compared with the FR and the IVM, which are widely used in probabilistic statistical methods and are simple and efficient, to objectively verify the RF. In such a way it is possible to objectively verify the effectiveness of the RF model, but also to provide an intuitive and effective reference for the early warning of landslides and other geologic hazards in northeast China.

4.3 Frequency Ratio Method

Utilizing the method of probabilistic statistics, the 12 evaluation factors were statistically calculated by the FR is statistically calculated for 12 evaluation factors, and then the assessment factors are superimposed to obtain the landslide sensitivity assessment map. Firstly 70% of the training set and 30% of the validation set divided by RF are utilized to classify the disaster sites. Secondly, the frequency ratio is calculated for each stream evaluation factor. Using ArcGIS, the landslide training set disaster sites are overlaid with each assessment factor map to extract the required data and calculate the frequency ratio for each category. This step was calculated for each evaluation factor (Table 2). Finally, a landslide sensitivity index was created and sensitivity mapping was performed by summing the frequency ratio values from the FR (Eq. 2)

$$ DSI=\sum \limits_{j=1}^n\frac{M_{ij}/{N}_{\mathrm{i}j}}{M_T/{N}_T} $$
(2)

where M is the number of factors, Mij is the number of landslide hazard sites in the jth subclass of factor i; Nij is the number of rasters occupied by the corresponding subclasses; MT is the total number of landslide hazard sites; NT is the total number of rasters occupied by the investigated area; DSI is the landslide susceptibility index, and the larger the susceptibility index is, the larger the probability of landslide occurrence.

Table 2 Frequency ratios of assessment factors that have the greatest impact on landslides

4.4 ROC Curve Verification

In this paper, the ROC curve is used to test the evaluation results of the simulation of the models. The ROC curve, which is called the receiver operating characteristic curve (ROC), can take each disaster site and its evaluation factors composed of individuals as the test object, and then take the occurrence of landslides and the non-occurrence of landslides as a dichotomous classification (positive and negative categories) The ROC curve corresponding to each model and its area under the curve (ACU) have been analyzed. ACU is the criterion for judging the advantages and disadvantages of the model used. When:

  • ACU = 0.5, it means that the model results does not have reference value, and

  • ACU<0.5 means that the model does not conform to the real situation;

  • ACU>0.5 and the value is closer to 1, it means that the model effect is more accurate.

5 Result and Discussion

The random forest algorithm model (RF) was constructed by SPSS software. This is to calculate the importance of the assessment factors according to the reduction of the average Gini value of each node of the decision tree, and to normalize the calculated factor weights. The calculation results are shown in Fig. 3a.

Fig. 3
A bar graph and 3 maps of the study area. A plots factor importance percentage versus impact factors with the bar for slope degree the highest. b, c, and d have assessment results respectively for I V M, F R, and R F, respectively. The landslide susceptibility is very high for hazard locations at the center.

(a) Importance of factors assessed by the RF method. (b) Assessment results for IVM, (c) Assessment results for FR, (d) Assessment results for RF

Slope, DW, and topographic relief are three very important factors affecting landslide development in the study area, with an importance of more than 10%, accounting for 34.9% of all factors. NDVI, stratigraphic lithology, altitude, and land use, with an importance ratio of 36.4%, are more important in influencing landslide development. Other factors, with a cumulative importance of 28.7%, have a lesser influence on landslide development.

According to the results of the calculations, the five most influential factors are slope, DW, topographic relief, NDVI, and stratigraphic lithology. Based on the regional geological data and tectonic background analysis, the main reason for this result is, possibly, falling into the next two categories:

  • small signaling for low altitude mountain slow landslides;

  • landslides disaster intensive place, in the elevation rise of the semi-mountainside area, within the area of the slope changes in the larger range.

The slope gradient directly determines the force of the loose accumulation of the slope and controls the initiation conditions of the landslide disaster. Therefore, analyzed from an objective point of view, the slope and topographic relief have relatively large weights. Factors such as the density of the river network, the volume of runoff and the size of the within the watershed area reflect to some extent the landslide water source conditions and catchment capacity affected by permafrost freezing and thawing in the region, indirectly controlling the initiation conditions of landslide dynamics.

The Lesser Khingan and the Greater Khingan Mountains permafrost areas have natural forest stands, and high vegetation coverage, which is conducive to the protection of permafrost and water conservation, to prevent soil erosion and the role of land sand. When the surface vegetation cover is sparse or destroyed in a large area, it is easy to cause the weathering of the exposed rock on the surface, forming a large number of detritus accumulations, which is more likely to cause the occurrence of landslides. The factor with the least influence is the DR.

Through the evaluation index system established in Fig. 2, the IVM grades the status of each evaluation index based on the GIS platform, and then analyzes the spatial overlay with the landslide hazard points to get the distribution number of landslides in the graded interval of the status of each evaluation index. According to the formula of the IVM, the information quantity of each assessment indicator is calculated, and then the total information value is calculated through the raster superposition analysis. The information value is intermittently graded through the results of the fieldwork, to carry out the ranking of the landslide susceptibility in the study area. Finally, the landslide susceptibility map is drawn (Fig. 3b).

In the FR sensitivity assessment, the calculation results of several factors with the greatest impact on landslides derived from the RF were selected and displayed in Table 2 and then overlayed in ArcGIS according to Eq. (2), which resulted in the landslide FR sensitivity map as shown in Fig. 3c.

Since the sample disaster sites are divided into a training set and a validation set, a model accuracy ROC curve is obtained using the training set data, and the area under its curve is the accuracy of the model. The accuracy and prediction rate of each model are organized in Table 3.

Table 3 Statistics of RF, IVM, FR model simulation and validation results

By comparing the accuracy rate of each model, IVM has the highest accuracy rate of 75.4%, which indicates that the informativeness method is more accurate than the RF model training effect and FR simulation results. While the difference between the two prediction rates is large, the RF prediction rate reaches 80.3%, and its model predicts the results with high accuracy. The FR prediction rate is 73.5%, which has a certain gap compared with the RF, and the model prediction effect is average. The difference between the accuracy rate and the prediction rate can reflect the stability of the model. However, the difference between RF accuracy and prediction rate is 6.7%, and the difference between IVM is only 3.7%, and the simulation results of IVM in permafrost regions are more accurate and stable.

6 Conclusion

The permafrost regions in China are landslide-prone areas, and based on the topographic, geological, vegetation, and anthropogenic factors in the study area, 12 assessment factors and three models commonly used in existing studies were selected for landslide susceptibility evaluation, and the results of the models were verified with ROC curves. The following conclusions were obtained:

  1. (1)

    The accuracy and prediction rate of IVM is higher than that of RF and FR methods, the difference between accuracy and prediction rate is smaller, and this means that the model is more stable, which is more suitable and effective for landslide sensitivity analysis in permafrost area.

  2. (2)

    According to the importance analysis of assessment factors by the RF model, slope, distance from the water system and topographic relief, vegetation coverage (NDVI) and stratigraphic lithology are the factors that have the greatest influence on the development of landslides in the permafrost regions.

  3. (3)

    Several factors that have the greatest influence on the development of permafrost directly or indirectly affect the permafrost condition, and also have a non-negligible influence on the freezing and thawing process of permafrost, so the state of permafrost should not be ignored on the influence of landslides. In addition to the traditional natural environment factors, the landslide susceptibility mapping in permafrost areas should also consider the influence of permafrost distribution and state.

  4. (4)

    The number of landslide occurrences of all three models increases with the increase of susceptibility class, and the susceptibility grading is in line with the actual field investigation results, which proves that all three models have certain applicability in the permafrost region. Comparing the distribution ratios of landslide verification sites in each susceptibility class interval, the accuracy and zoning of the FR are greatly affected by the number of samples, while the IVM model is relatively more affected by the spatial distribution of the assessment factors, and the prediction results are relatively stable compared with those of the FR and RF models. However, the accuracy of several models is almost below 75%, and the disaster susceptibility assessment model that is more applicable to the permafrost regions is waiting to be developed.

  5. (5)

    Due to the large scale of the permafrost regions in northeast China, the natural environment is relatively complex. The permafrost is now severely degraded under global warming, which causes the historical statistical points and data of landslides to be much smaller than the actual number of landslides. Less statistical data will affect the simulation results of models such as RF and FR in the permafrost area, so the statistics and census of landslides in the permafrost regions and the study of the spatial distribution of permafrost are also urgently needed.