A step beyond landslide susceptibility maps: a simple method to investigate and explain the different outcomes obtained by different approaches

Landslide susceptibility assessment is vital for landslide risk management and urban planning, and the scientific community is continuously proposing new approaches to map landslide susceptibility, especially by hybridizing state-of-the-art models and by proposing new ones. A common practice in landslide susceptibility studies is to compare (two or more) different models in terms of AUC (area under ROC curve) to assess which one has the best predictive performance. The objective of this paper is to show that the classical scheme of comparison between susceptibility models can be expanded and enriched with substantial geomorphological insights by focusing the comparison on the mapped susceptibility values and investigating the geomorphological reasons of the differences encountered. To this aim, we used four susceptibility maps of the Wanzhou County (China) obtained with four different classification methods (namely, random forest, index of entropy, frequency ratio, and certainty factor). A quantitative comparison of the susceptibility values was carried out on a pixel-by-pixel basis, to reveal systematic spatial patterns in the differences among susceptibility maps; then, those patterns were put in relation with all the explanatory variables used in the susceptibility assessments. The lithological and morphological features of the study area that are typically associated to underestimations and overestimations of susceptibility were identified. The results shed a new light on the susceptibility models, identifying systematic errors that could be probably associated either to shortcomings of the models or to distinctive morphological features of the test site, such as nearly flat low altitude areas near the main rivers, and some lithological units.

A landslide susceptibility assessment is generally based on the analysis of the correlation between the location of landslide areas and the spatial distribution of a wide set of predisposing factors, usually including geological, geomorphological, and hydrogeological features and soil/land cover characteristics. The correlation can be established by means of various techniques: to date, statistical approaches (Reichenbach et al. 2018, and references therein), artificial neural networks (Lee et al. 2004;Ermini et al. 2005), and machine learnings algorithms (Brenning 2005;Yao et al. 2008;Pham et al. 2016) are among the most widespread and established techniques. To improve the effectiveness of landslide susceptibility assessments, brand new methods and hybrid versions of already established methods are continuously proposed (Huang et al. 2017;Lagomarsino et al. 2017;Shirzadi et al. 2017;Tsangaratos et al. 2017;Pham et al. 2018;Yang et al. 2019). However, some studies warned that besides the model used, the results of the susceptibility assessment, and the resulting maps, are sensitive to various factors: the model configuration, the sampling strategy, the selected conditioning factors, and the spatial resolution Sbroglia et al. 2018;Liu et al. 2019). The most widespread technique to validate landslide susceptibility maps is to build ROC (receiver operating characteristic) curves and to calculate the AUC (area under the curve), which is commonly used also as an indicator of the spatial forecasting effectiveness of the susceptibility model (Frattini et al. 2010).
In landslide susceptibility studies, the comparison of different models in the same test site is a common practice, surely more than in other fields of landslide research such as rainfall thresholds (Lagomarsino et al. 2015) or distributed physically based models (Sorbino et al. 2010). Usually, when a new approach is proposed, a comparison with other state-of-the-art models is performed (Shirzadi et al. 2017;Chen et al. 2018;Pham et al. 2018); in addition, the recent scientific literature on landslide susceptibility is rich of application to case studies where different susceptibility maps of the same area are obtained by means of different methods (Yilmaz 2009;Pham et al. 2016;Youssef et al. 2016;Bueechi et al. 2019). However, the different models are compared just in terms of AUC, and the one with the highest score is deemed to outperform the others. A discussion of the reasons why different models produced different results in some given map locations is rarely carried out, and a thorough investigation of the spatial patterns of the differences obtained is usually lacking, thus diminishing the geomorphological meaning of the comparison between susceptibility maps.
The objective of this paper is to show that the classical scheme of comparison between susceptibility models can be expanded and can be enriched with substantial geomorphological insights. We used four susceptibility maps of the Wanzhou County (China) obtained by Xiao et al. (2019) using four different classification methods (namely random forest, index of entropy, frequency ratio, and certainty factor). A quantitative comparison of the susceptibility values estimated on a pixel-by-pixel basis was carried out to reveal systematic spatial patterns in the differences among susceptibility maps; then, the identified patterns were put in relation with all the explanatory variables used in the susceptibility assessments. Using this approach, the lithological and  Table 1 for explanations of lithological units)
The study area extends into the subtropical humid monsoon zone and features a mild climate with abundant sunshine and a mean annual precipitation of 1191.3 mm, mainly concentrated from May to September (about 90% of the yearly rainfall). During summer, the rain is characterized by short intense rainstorms (up to more than 100 mm/day). The Yangtze River runs throughout the study area from southwest to northeast, and 93 large and small streams form a complex surface runoff network. The elevation gradually decreases from east to west forming a hilly landscape, with an overall step-like morphology formed by multi-level fluvial terraces, which resulted from the combination of repeated tectonic uplift stages and the Yangtze River erosion (Liu 2010) (Fig. 1).
The bedrock lithology encompasses sandstones, mudstones, shales, and limestones (Table 1), with nearly horizontal stratifications. Extending from both sides of the Yangtze River, the outcropping bedrock mainly increases in age from Triassic to Jurassic (2.3 to 137 Ma), with sporadic Permian (299 to 252 Ma) and Quaternary (from 2.5 Ma). The middle Jurassic Shaximiao Group, consisting of alternating layers of sandstone and mudstone (J2 s in Table 1), is the most widely distributed geological unit.
The total area of landslides is 102.64 km 2 , accounting for about 3% of the total study area. They are mainly slow-moving slides, with near horizontal sliding surface (Gui 2014). The maximum and minimum sizes of the landslides identified and mapped in the study area are approximately 9.6 × 10 5 m 2 and 30 m 2 , respectively. Since the impoundment of the Three Gorges Reservoir in 2003, many dormant landslides have been reactivated, mainly triggered by water level fluctuation and rainfall (Gui 2014). The well-known Anlesi Landslide, Caojiezi Landslide, and Taibaiyan Landslide are all ancient landslides with a volume of more than ten million cubic meters, and they all developed in sub-horizontally dipping sandstone and mudstone interbedded strata (Gui et al. 2016).

Input data and methodology
Four landslide susceptibility maps of Wanzhou County are used in this research (Xiao et al. 2019), and they are based on random forest (RF), frequency ratio (FR), certainty factor (CF), and index of entropy (IOE) models. Thirteen causal factors were selected for susceptibility assessment: altitude, slope, aspect, plan curvature, profile curvature, stream power index (SPI), topographic wetness index (TWI), bedding structure, lithology, land use, geological structure, distance to rivers, and distance to roads/railways. Table 2 provides further details on the classification scheme used for every causative factor and their importance in the susceptibility assessment: "FR" and "CF" represent the contribution of each class calculated from the FR and CF models,

Original Paper
Landslides 17 & (2020) respectively, and "Wj" is the weighting coefficient of each factor (calculated using the IOE model). Distance to rivers was variably divided into several classes depending on the typology of rivers, and the same as the roads. The AUC of RF, FR, CF, and IOE models were 80.1%, 72.7%, 72.9%, and 73.8%, respectively. Further details on how the susceptibility assessments were carried out can be found in Xiao (Xiao et al. 2019).
Analysis and comparison of these four susceptibility maps is the main research aim. The proposed method of comparison includes the following steps.
-First, in a GIS system, the susceptibility maps were paired, and their values were subtracted to define their differences (Fig. 2). Since the RF model was characterized by a higher AUC value than the other three and by a better spatial agreement between inventoried landslides and highest susceptibility classes (Xiao et al. 2019), it was chosen as a benchmark, and six comparison maps were obtained following two distinct criteria. Group 1 comprises all the straight comparisons with the benchmark and includes "RF-FR," "RF-CF," and "RF-IOE" (Fig. 2a-c), while group 2 includes comparisons among the other models, namely "IOE-FR," "IOE-CF," and "CF-FR" (Fig. 2d-f). Since the raster value of each susceptibility map is between 0 and 1, the values of the comparison maps could potentially range from − 1 to 1. A simple visual inspection of Fig. 2 reveals that the differences in group 1 are wide, while in group 2, they are relatively small. Most important, the differences are not evenly distributed, and some spatial patterns can be clearly recognized in almost all comparisons. This step can be considered a preliminary stage, allowing a visual inspection of the spatial distribution of the differences and providing quantitative data to be analyzed in the further steps. -The value of comparison maps was divided into three levels, namely "underestimation," "approximation," and "overestimation." The value of maps in group 1 is broken at − 0.50 and 0.50, while the value is broken at − 0.25 and 0.25 in group 2. We used a different approach in group 2 because the models involved provided quite similar results with differences contained between + 50% and − 50%. Compared with group 1, the different classification demands a different interpretation of the results: in the first group, the focus is finding and inspecting relevant differences, and in the second group, the focus is inspecting small differences. The percentage of each level in the total area is shown in Table 3.
-The subsequent step of the methodology is aimed at identifying systematic underestimations and overestimations, with a spatial pattern that could be influenced by one or more causative factors used in the susceptibility assessment.
To explore the key factors that led to differences between susceptibility maps, a large number of statistics (Table 4 and  Table 5) were performed on the area interested by underestimations and overestimations for each class of each causative factor. Data presentation and analysis are in the next section. -The critical classes and factors are further investigated by counting the underestimations/overestimations there located and calculating the mean value and standard deviation of the difference with the benchmark. -As a last step, the results obtained are critically analyzed: the factors and classes responsible for systematic overestimations and underestimations of certain models are discussed trying to identify possible reasons of success/failure of the susceptibility models in some geomorphological settings.

Results
For each pair of maps, the overestimation and underestimation pixels were put in relation with all the causative factors used in the susceptibility analysis. Thirteen causal factors were used in the susceptibility maps, and each factor was divided into several classes (totaling 80 classes). For each class, we calculated "A" as the percentage of each class in the total area, "B" as the ratio of the underestimation pixels found in the class to total underestimation, and "A − B" as the difference between the two ratios, which could be used to identify critical classes with anomalous clustering of underestimations. The same process was repeated for the overestimations.
As an example, in Table 4, the comparison between RF and FR maps shows that concerning altitude factor, class 1, covers only 14.61% of the total area (A), but it contains 68.25% of the underestimation pixels (B), providing an imbalance of 53.64% (B − A) that could be symptomatic of a systematic distribution of underestimations.
Overall, thirteen classes were analyzed according to this procedure. For the ease of reading, only two representative factors are listed in Table 4 and Table 5, while a complete summary is provided as supplementary material. The number of underestimation pixels of IOE-FR and CF-FR comparison map and overestimation pixels of IOE-CF comparison map is nearly zero, so they do not appear in Table 5.
According to the "B − A" values defined for each class, the classes with the highest imbalances were selected for further investigation (Table 6). Overestimation pixels in group 1 are clearly related to classes 1, 2, and 6 of lithology, while several factors are involved in systematic underestimations: altitude, slope, and aspect are the main imbalanced classes in "RF-FR" and "RF-CF," while 98.56% underestimation area of "RF-IOE" is located in class  1 of altitude. In group 2, the underestimation pixels of "IOE-CF" are randomly distributed and no regular pattern could be observed. The overestimation pixels of "IOE-FR" are all located in class 1 of altitude, and 98.04% overestimation pixels of "CF-FR" are in class 2 of altitude. The most imbalanced classes underwent also a visual inspection that clearly exemplifies their relationship with underestimation or overestimation areas. Figure 3 reports four figures that clearly visualize this connection: underestimations of "RF-IOE" driven by altitude class 1 (Fig. 3a), overestimations of "RF-IOE" driven by lithology (Fig. 3b), overestimations of "IOE-FR" driven by altitude (Fig. 3c), and overestimations of "CF-FR" driven by altitude (Fig. 3d). In each figure, the yellow represents the imbalanced class(es), and the other classes are in gray. Almost all underestimation pixels are at class 1 of altitude, and the percentage is 98.56% marked in Fig. 3a. In Fig. 3b, there is a clear trend for overestimation pixels, and 63.16% of the pixels are in lithology classes 1, 2, and 6. All the overestimation pixels are located in class 1 of altitude in Fig. 3c. For CF-FR comparison map (Fig. 3d), 98.04% of overestimation pixels are located in class 2 of altitude.
The analysis of underestimations and overestimations, as defined in Table 3, characterizes the distribution of extreme values, while to represent the overall situation in each class, mean values are more useful. Starting from the comparison maps (Fig. 2), simple statistical properties of the differences in susceptibility values and their distribution across each critical class is calculated and listed in Table 7; moreover, the differences in susceptibility values were averaged for each class and histograms were built (Fig.  4). The brown in the histogram is the class with the most pixels of underestimation, and the purple column is the class with a distinct concentration of overestimation pixels. In general, the distribution of mean values is driven by the same classes that drive the extreme underestimation and overestimations. In Fig. 4a, the value of class 1 is negative, while the values of other five classes are similar and positive, this highlighting the clear underestimation trend associated to that class. Concerning lithology in Fig. 4b, all classes are characterized by positive (or very small negative) values, but classes 1, 2, and 6 stand out for their higher values. In Fig. 4c, only class1 is positive, and the other five categories are negative and significantly lower than class 1. The value of class 2 is the highest in Fig. 4d, but not much different from the other five classes, which is in accordance with the proportion of extreme pixel in the comparison map. The percentage of overestimation in CF-FR map is only 0.28%, while the percentage of underestimation in RF-IOE, overestimation in RF-IOE, and overestimation in IOE-FR is 0.93%, 2.42%, and 0.72%, respectively. Table 3 we observed that extreme overestimations and underestimations are limited in number (the overall discrepancies range from 1.57% of the total area in RF-FR to 3.35% in RF-IOE), the impact on AUC values is great (AUC values of IOE, FR, and CF are lower than RF by about 6-7%). This is because overestimations and underestimations are not randomly distributed, but some spatial patterns are present. These patterns, in turn, are clearly related to systematic errors in the susceptibility assessment that can be linked to a limited effectiveness of certain models to account for specific features of the test area.  Concerning group 1 (RF model used as a benchmark in comparison with all other models), extreme values of both overestimations and underestimations are present and both show a spatial pattern that can be related to some morphological or geological feature.

Even if in
Our analyses showed that for all the three comparisons, overestimations have a similar trend which is driven by the same geological factors. Indeed, the classes 1, 2, and 6 of lithology clearly stand out as the most important for all the comparisons. We could not identify some spatial, geological, or structural feature that is associated to these three classes and at the same time differentiate them from the remaining nine classes (see also Fig. 1 and Table 1). We therefore concluded that the reason is more likely related to a higher effectiveness of the random forest algorithm in analyzing a relatively high number of geological classes (namely, 12) and using them for a more reliable susceptibility assessment. As a consequence, it would be possible that in this study lithology has a more significant contribution in RF model when calculating the susceptibility. To verify this hypothesis, we produced a new susceptibility map based on RF model using only 12 factors (lithology was not taken into account): its AUC was only 55%, much lower than the 80% value obtained with the full configuration. This proves the conjecture that lithology occupies a vital importance in RF model and that in this study RF could exploit a complex lithological information better than the other models. Table 5 Statistics on underestimation pixels and overestimation pixels in two representative factors (group 2)

Factors
Classes

IOE-CF None
Concerning the spatial distribution of underestimations, different trends were identified for the three comparisons of group 1. For "RF-FR" comparison, our analyses showed that they are clearly driven by class 1 of altitude ("B − A" = 53.64%), class 1 of slope ("B − A" = 43.73%), and class 1 of aspect ("B − A" = 36.27%). The three morphological classes have a high degree of spatial overlap and basically represent low elevation and flat areas alongside the Yangtze River. This is confirmed also by the values characterizing class 6 of land use (areas besides water bodies-"B − A" = 45.68%) and by the planar and profile curvature factors (see supplementary material), for which the classes representing planar morphologies are imbalanced toward underestimation as well. Similar flat morphologies are typical of the study area and have been traditionally explained with two main geomorphological reasons (Li et al. 2001;Jian et al. 2009;Huang 2012): differences in weathering of horizontal geological formations (sandstone with interbedded mudstone) and presence of relict fluvial terraces formed by the Yangtze River and its tributaries.
For "RF-CF" comparison map, there are six important classes and no one is more prominent than others. "RF-CF" map has fewer extreme underestimations than "RF-FR" map, which may be due to the different distribution of CF and FR value. In the landslide susceptibility evaluation, the RF value of each factor ranges from 0 to 3.63, and the CF value ranges from − 1 to 0.75. The range of CF values is smaller, and the distribution is more even than FR values.
For "RF-IOE" comparison map, class 1 of altitude is the only imbalanced class. IOE model has a weight coefficient for each factor, and altitude in the study area is the largest in weight and significantly larger than the other factors. At the same time, the contribution of this class is the largest among the six classes of altitude. This is why the susceptibility map performed by IOE model has a too large value in class 1 of altitude.
To sum up, the application of the proposed approach in group 1 allowed to identify systematic errors in some of the models and to relate them to a combination of computational characteristics of the algorithm and morphological features of the study area.
Even if landslides are geomorphological processes, landslide susceptibility assessments usually rely on advanced statistical approaches that, although providing good results, completely bypass the geotechnical triggering mechanism and neglect the geomorphological features of the study area. The models used in this comparison make no exception, but the comparison procedure proposed in this manuscript can lead to identify and explain some systematic errors. In addition, the outcomes of the statistical analysis described in this work can be used as a starting point to address further and more thorough geomorphological investigations on the study area. For instance, the comparison highlighted a possible involvement of relict fluvial terraces in controlling the landslide susceptibility of the area in low altitude and low gradient areas. Nowadays, terraces are not present everywhere and have been partially masked or erased by natural processes and human modifications of the territory (Li et al. 2001;Yang et al. 2017). However, the presence of many systematic errors concentrated in the areas once occupied by terraces may be symptomatic of the fact that landslides could have been one of the geomorphological processes involved in the disruption of the relict terraces. Of course, this is only a hypothesis in the attempt of giving a physical interpretation to the results of the statistical models used in the susceptibility assessment. Nevertheless, as stated in the test site description, landslides can be found on fluvial terraces and can be activated by water level fluctuations in the reservoir (Yang et al. 2017). Simpler models (FR, CF, and IOE) fail in addressing such complex concepts given the simple inputs available for the statistical correlation (morphological attributes and thematic maps). In contrast, RF model indirectly "recognizes" the geomorphological fluvial terrace feature and the mechanism of the water level fluctuation by combining some simple features (low gradient, flat morphology, elevation close to the river course); thus, it calculates a more appropriate susceptibility level.
The number of imbalanced pixels cannot be closely related to the AUC, as we observed that the statistical model with the lowest AUC (IOE) is the least imbalanced one and the most imbalanced one has the highest AUC (FR) ( Table 3). In this case of study, this outcome has a relative importance because the differences in AUC between FR, CF, and IOE are very contained; nevertheless, it represents an additional argument stressing the importance of the need of comparison procedures more advanced than just a comparison between AUC values.
Concerning group 2 (comparison among IOE, CF, and FR), the models provide similar results, leading to similar AUC values and absence of extreme underestimations and overestimations. In addition, the spatial distribution of the small differences in susceptibility values is quite random, meaning that no class can imprint a spatial pattern.
In the "IOE-CF" comparison, there is not a single class or a group of classes that clearly dominates the spatial distribution of underestimations, while overestimations are nearly absent.
In the "IOE-FR," the underestimations are negligible, while the overestimations, although small in absolute value, show an interesting trend: they are all located in class 1 of altitude. The weight coefficient is the only difference between IOE and FR model. Altitude has a largest weight coefficient and class 1 contributes most among classes, so class 1 of altitude is the area where the value of IOE susceptibility map is much larger than FR susceptibility map.
A similar trend can be observed in "CF-FR": the underestimations are nearly absent, while overestimations are mainly found in class 2 of altitude which may be due only to their different computational characteristics of the algorithm. However, the differences of "CF-FR" are very small, from 0.09 to 0.30. The AUC of susceptibility map based on FR and CF models are 72.7% and 72.9%, respectively. It indicates that the two susceptibility maps based on FR and CF models are very similar, both in terms of AUC values and spatial distribution of susceptibility values.
It should be stressed that our methodology proposes a new approach to make a more thorough comparison among susceptibility maps; therefore, the definition of overestimation and underestimation should not be considered in an absolute way: we defined them by means of a comparison among different models and all of them are representations of reality, not reality itself. Among the four models, the one with the highest AUC value is selected as a benchmark, but after that, a procedure is developed to expand the comparison beyond the simple AUC values.  Rossi et al. (2010) showed that the best AUC does not necessarily means best predictive power, and they suggested that a more effective susceptibility assessment can be obtained merging different models. Our findings are partially in line with these statements and lead them to a further step, where the geotechnical and geomorphological reasons of differences should be ascertained, and systematic errors should be identified and excluded from the generalization procedure to improve the hazard assessment. However, in our test site, a merging or averaging of the four available susceptibility maps would not be advisable: the analysis of group 2 highlighted that IOE, CF, and FR provided very similar maps; thus, their joint use would lead to include redundant information in the final hazard assessment; more importantly, this information would be affected by systematic errors caused by a limited effectiveness of the models to correctly take into account some morphologic and lithological parameters.

Conclusion
This paper shows that the classical scheme of comparison between susceptibility models in terms of AUC can be expanded and can be enriched with substantial insights connected to the physical features of the study area. Working on a test site in Wanzhou County, a quantitative comparison of the susceptibility values estimated on a pixel-by-pixel basis was carried out to reveal systematic spatial patterns in the differences among four susceptibility maps obtained with four models (RF, IOE, FR, and CF), then the identified patterns were put in relation with all the explanatory variables used in the susceptibility assessments. This procedure showed where the most relevant differences among the susceptibility maps were located and the morphological and lithological features of the study area that are typically associated to underestimations and overestimations of susceptibility could be identified. In our case of study, the main differences in susceptibility maps were clearly driven by some geological formations and by nearly flat low altitude areas near the main rivers. The results shed a new light on the susceptibility models identifying systematic errors that could be possibly associated to distinctive geomorphological features of the test site.
Compared to traditional methods for the comparative analysis of different assessments of landslide susceptibility (mainly based only on AUC calculation and visual inspection of the maps), we believe that the proposed procedure has some advantages: -It is more thorough, since maps are compared pixel by pixel instead of summarizing their overall performance in a single parameter (AUC) and comparing that single value; -It is more detailed and refined, leading to outcomes that can be useful to gain new perspectives and insights on the susceptibility assessments, possibly identifying weak points of the models used; -It holds a more complete geomorphological meaning: landslides are complex processes, and differences in predictive performances of the models and in the resulting mapped values could be related to an inadequate parameterization of the geomorphological features of the study area; -It constitutes a step forward toward the recognition of the importance of geomorphology in interpreting the results of landslide susceptibility assessments: advanced approaches based on statistics or machine-learning algorithms usually completely bypass the geomorphological implications of the problem at hand, and geomorphology should be used to interpret the results obtained; -It provides more substantial conclusions: averaging different susceptibility assessments or comparing maps and taking the one with the highest AUC as the "ground truth" could lead to biased hazard assessments. The methodology of comparison proposed in this paper could be used to ascertain localized and generalized underestimations and overestimations of susceptibility, thus reducing uncertainties and identifying hotspots worth of further inspection.