1 Introduction

Landslides are capable of inflicting large-scale human casualties and economic losses (Petley 2012), and as a result they are widely researched. Continuing geoscientific advances have established routines to estimate where landslides may occur. This concept is referred to as susceptibility (Reichenbach et al. 2018). The standard definition of susceptibility corresponds to a relative likelihood of landslide occurrence within a specific mapping unit, under the control of a set of predisposing factors (Fell et al. 2008; Tanyaş et al. 2021). For over three decades, this definition has been addressed by implementing different practices or techniques that have largely evolved following recent technological advances. During the early 1970’ies and until the 1980’ies, drawing susceptible slopes on a map was the result of a combination of field surveys and manual geomorphological mapping (Verstappen 1983). As GIS made its appearance, bivariate statistical models followed closely and they became commonplace (Naranjo et al. 1994) until they were superseded by their multivariate counterpart (Süzen and Doyuran 2004). The latter, in the form of binomial Generalized Linear Models (GLM), has then become the standard and is still the most widely used technique in the landslide susceptibility literature (Budimir et al. 2015; Reichenbach et al. 2018). However, the hypothesis that landslide occurrences are linearly related to a number of predisposing factors is a strong assumption that may not be supported by evidence in reality. Because of this, in recent years, a number of models capable of estimating nonlinear relations have been proposed as a further improvement to the more simplistic GLM framework. Some of these have simply translated the GLM context to a more flexible Generalized Additive Model (GAM) structure (Goetz et al. 2011; Lombardo and Tanyas 2020; Steger et al. 2021; Titti et al. 2021). Others have taken the route of machine learning. Among these, support vector machines (e.g., Marjanović et al. 2011), decision trees (e.g., Lombardo et al. 2015), maxent (Di Napoli et al. 2020) and neural networks (e.g., Lee et al. 2004; James et al. 2021; Anderson-Bell et al. 2021; Schillaci et al. 2021) have gained the spotlight within the geoscientific community.

However, irrespective of the algorithmic architecture, all these models have kept answering the same scientific question over the decades. They have sought to estimate where landslides may occur, being blind though to how large landslides may become, once they trigger in a given location. Conversely, this problem is typically addressed via physically-based approaches (Li et al. 2012; Bout et al. 2018; Van den Bout et al. 2021), although their applications over large regions is commonly hindered by geotechnical data availability. On the contrary, data-driven approaches can by-pass this issue by using proxies instead of geotechnical parameters. And yet, only a single contribution currently exists where landslide sizes (or their aggregated measure per slope units to be precise) are statistically estimated over territories whose scale is not suited for physically-based modeling. This work, authored by Lombardo et al. (2021), makes use of a GAM framework, under the assumption that landslide sizes (measured as planimetric areas) behave over space according to a log-Gaussian probability distribution. However, this work missed to address a fundamental problem. In fact, the use of a Gaussian likelihood implies that the bulk of the distribution is well estimated but the tails are inevitably misrepresented. As a result, large landslides pertaining to the right tail of the size distribution are underestimated, hence underestimating the hazard to which local communities and infrastructures would be exposed to.

In this work, we take a different route, while still trying to combine some essential elements in the research progress described above. Firstly, we opted for Neural Network as our reference model structure. Such a framework is particularly appealing thanks to the widespread support and use in Python (Van Rossum and Drake Jr 1995). Its libraries make codes easily shareable and the resulting analyses easily reproducible. Secondly, we modeled both the traditional landslide susceptibility together with the landslide sizes. However, differently from Lombardo et al. (2021), here we introduce the structure of a multi-label classifier rather than regressing the continuous distribution of landslide planimetric areas. Specifically, we run a Hierarchical Neural Network, where the susceptibility element is linked to the landslide susceptibility, and a multi-label element returns the probability of a specific landslide size class. This is one of the major strength of a Neural Network, as it allows to implement hierarchical models with relative ease, which would otherwise require complex joint probability models in statistics (Pimont et al. 2021).

We do so by testing our modeling protocol over the co-seismic landslides triggered by the Gorkha earthquake (25th April 2015). The landslide inventory genereted by Roback et al. (2018) was accessed at the repository built by Schmitt et al. (2017); Tanyaş et al. (2017). And, the mapping unit we chose corresponds to Slope Units (Carrara 1988).

The manuscript is structured as follows: Sect. 2 provides a description of the area where landslides occurred in response to the Gorkha earthquake, this being followed by a description of the mapping unit and predictors we chose (Sect. 3). In Sect. 6 we explain the Neural Network structure and how we made use in this work. Sections 7 and 8 present and comment the results, respectively. And ultimately, Sect. 9 summarizes the novelty of the work we present and anticipates future extensions.

We stress here that to promote similar applications we have shared data and codes in a github repository accessible at link here.

2 Study area and landslide inventory

On April 25th 2015, the Gorkha earthquake 7.8 \(\hbox {M}_W\) (Elliott et al. 2016) struck the area in the proximity of Kathmandu, Nepal (see Fig. 1a, b). The resulting losses amounted to approximately 9000 victims which, together with the infrastructural damage, made this event the worst natural disaster in Nepal since 1934 (Kargel et al. 2016). The earthquake affected-region covered an area of about 500 by 200 kilometers. And, a large portion of this area, also suffered from widespread landslides (see Fig. 1c), with 24,915 landslides—of which, 24,903 of them were triggered by the 25th of April mainshock—and a total of 86.6 \(\hbox {km}^2\) of planimetric landslide area (mean = 0.003 \(\hbox {km}^2\); standard deviation = 1.74 \(\hbox {km}^2\); maximum = 1.72 \(\hbox {km}^2\)). Figure 1c shows both the spatial distribution of the mapped landslides and the boundary of the surveyed area by Roback et al. (2017) to generate the coseismic inventory. Their inventory is currently considered one of the most accurate and complete (Tanyaş and Lombardo 2020) among the available global coseismic inventories (Schmitt et al. 2017; Tanyaş et al. 2017).

Fig. 1
figure 1

Panel a show the large scale geographic context; panel b zooms in to highlight the ground motion induced by the Gorkha earthquake; panel c depicts the spatial distribution of the coseismic landslides, this being shown in terms of spatial densities; panel d zooms further in to provide an overview of the slope unit partition we generated and its consistency with the underlying aspect distribution

3 Mapping unit

We partition the study area into Slope Units (SUs). These correspond to a refinement of the SUs already generated and presented in Tanyaş et al. (2019)—and further details on their computation will be provided below. We remind here that a slope unit corresponds to the space bounded between ridges and streamlines, of catchments definable across different scales. Therefore, they efficiently depict the morphodynamics of a landslide and they also provide a spatial partition for which a single unit is independent from the others in terms of initial failure mechanism (Alvioli et al. 2020; Amato et al. 2019). We generated the SUs by using r.slopeunits, a software developed and made available by Alvioli et al. (2016). The software optimizes the SU generation by maximizing within-unit homogeneity and outside-unit heterogeneity of the slope exposition (Alvioli et al. 2018), in a deterministic framework where the output is mainly controlled by four parameters (Alvioli et al. 2016; Castro Camilo et al. 2017):

  1. 1.

    Circular variance (cv), the parameter which regulates how strictly r.slopeunits should define homogeneity. The possible values are bound between 0 and 1, one allows for small variance of the aspect wheres the other extreme allows for large aspect variance.

  2. 2.

    Flow accumulation threshold (FAtresh), this parameter initializes the search and it typically represents the starting size of the spatial partition, from which r.slopeunits further dissects the landscape into smaller units.

  3. 3.

    Minimum SU area (areamin), a parameter representing the lower bound r.slopeunits tries to optimize for the SU delineation.

  4. 4.

    Cleaning SU area (cleansize), a parameter representing the size below which r.slopeunits optimizes a merging routine where small units are fused into larger adjacent ones.

Here we start from the r.slopeunits parameterization used in Tanyaş et al. (2019). But, as their work was based on a global scale, the SU they delineated were too coarse for a site specific assessment. Therefore, we slightly modified the required parameterization and set the cv to 0.4, FAthres to 500,000 \(\hbox {m}^2\), areamin to 80,000 \(\hbox {m}^2\), and cleansize to 50,000 \(\hbox {m}^2\). Figure 1d shows the details of the SUs we generated, superimposed to the aspect for clarity. Overall, their resulting characteristics feature an average SU area of 0.22 \(\hbox {km}^2\) and a standard deviation of 0.21 \(\hbox {km}^2\); the equivalent size of these two statistical moments being already a clear indication of how rough the landscape is at the transition between lesser and greater Himalayas (Burbank 2005).

4 Predictors

The predictors we use are reported in Table 1. There we report ten predictors, nine of which (all but \(\hbox {SU}_A\)) have been calculated at a much finer spatial resolution compared to the SUs. But, as our model is expressed at the SU scale, we summarized at this level the distribution of each predictors, via their respective mean and standard deviation values. As a result, the model we build features 18 predictors rather than nine, with the addition of the \(\hbox {SU}_A\).

Table 1 Predictors’ summary table

Overall, each of the predictor variables are normally distributed, so standardization ( mean of 0 and a standard deviation of 1; Ali et al. 2014) took place to regulate the scales of each variable. We also tried a min-max normalization (see Al Shalabi et al. 2006), but the standardization produced better performance (the result of these preliminary tests are not reported here), resulting in the choice to standardize the predictors instead of normalizing them.

5 Preprocessing

There does not currently exist any standard for classifying landslide size in terms of planimetric area. One of the most important reason one might consider dividing the landslides into classes based on area is to then be able to investigate how certain predisposing factors affect each class, as this may shed light upon how small to large landslides are catalyzed. In order to develop a method to optimally divide the landslide area (aggregated as the sum of all landslide planimetric areas per SUs) into classes, we used the Fischer-Jenks algorithm to derive the class boundaries (see, Jenks 1967). The Fischer-Jenks algorithm determines the optimal break points by choosing those that minimize each category’s variance, though it requires information on how many breaks are to be used. To determine the optimal number of breaks, various break counts were trialed in ascending order, starting with two, on each iteration measuring the Goodness of Variance Fit (GVF). GVF is a value between 0 and 1 and indicates how well the categories produced by the Fischer-Jenks algorithm reflect the “natural breaks” in the data (Khamis et al. 2018). This iterative process continues until the number of breaks associated with a GVF value beyond a certain threshold is reached, in this case, the value chosen was 0.85. Class 0 landslides are SUs with an area of 0, indicating that a landslide did not occur.The remaining non-zero values were broken into 5 classes using the Fischer-Jenks method described above, resulting in the six classes described in Table 2.

Table 2 Preliminary landslide class structure

However, two observations arose at this stage: firstly, the data is heavily imbalanced. This is not an uncommon issue when dealing with classification problems, and there are several ways of remedying this issue. In this case, we have opted to apply oversampling, which uses existing datapoints to interpolate artificial values, assigned to artificial records that closely resemble existing ones. The process of generating artificial data points occurs until each class has the same number of records, albeit with a proportion of them being synthetic (Chawla et al. 2002).

The second issue, however, is that for oversampling to take place, a minimum number of existing datapoints are required in each category to allow for interpolation. As is seen in Table 2, class 5 only has one record—an insufficient number to apply oversampling to. We have then overcome this issue by combining class 5 with class 4. We have also merged Class 2 into class 1 to reduce the number of tunable parameters by 33% overall, thus decreasing training time substantially and increasing performance as a result (more information available in Sect. 6.2). This produced the final classes detailed in Table 3.

Table 3 Preliminary landslide class structure

The next step was to create landslide presence/absence binary values. It was done by making a copy of the landslide class data and setting all non-class 0 values to 1, and all class 0 values to 0 (as numerical data). The final preprocessing done on the dependent variables was to encode the landslide class data so it could be used in a multi-class classifier model.

The result of these operations can be seen in Fig. 2 where the panel a shows the traditional presence/absence landslide data, whereas panel b provides further insight into the classification scheme we adopted to run landslide-size-oriented prediction models.

Fig. 2
figure 2

Upper panel highlights the distribution of SUs labeled as stable or unstable; the lower panel provides additional graphical information on the classification scheme we opted for to predict specific ranges of landslide size classes, aggregated per SU

5.1 Exploratory data analysis

Figure 2a shows that the spatial distribution of landslide occurrence does not appear to be uniform across the study area. Conversely, the overwhelming majority of SUs with landslides occurred in the Northern-Central sector. Small clusters of landslides also appear on the Southern border, although they are generally fewer in number, and even fewer are present in the Eastern and Western peripheries. Table 4 shows that the pattern described above results in a significant global spatial autocorrelation. This implies that there are spatial processes that influence the SUs landslide susceptibility, and this observation can be used to inform the interpretation of results, as good results should also reflect this spatial autocorrelation, likely induced by local amplification of the ground motion.

Table 4 Spatial autocorrelation of binary landslide occurrence

Figure 2b shows the spatial distribution of landslides classes, using the categories defined in Sect. 5. Figure 2b further elucidates that the Northern sector features more slope units with landslide occurrences. And that as expected, the number of SU associated to Class 1 is predominant compared to Class 2, and that Class 2 is again more numerous than Class 3. This is a natural consequence of the landslide size distribution and it follows the the Frequency Area Distribution summarized in (Fan et al. 2019) for the Gorkha case.

6 Hierarchical deep neural network for landslide class prediction

The HNN implemented in this study is designed to predict the likelihood of landslide occurrence and size in spatial units, across the study area.

Hence, for each SU (see Sect. 3) and by using the variables introduced in Sect. 4 the model predicts the likelihood of landslide occurrence, and if a landslide is likely to occur, it also estimates what landslide-size-class is expected.

We have chosen to build a two output hierarchical neural network to guide the model’s training (inductively bias the model) to focus on the binary occurrence of landslides and subsequently the size of the landslide. As a result, the model becomes more accurate. Another merit of having both outputs is that it enables a more straightforward comparison to existing approaches. The output predicting the landslide susceptibility can be used to compare our HNN to other models that perform the same task. This is the best way of comparing the model to existing approaches because there is no widely used or accepted model that predicts size and therefore it is not possible to compare the landslide-size-class prediction with an equivalent baseline.

6.1 Model architecture

We stress here that the model we implemented is a single one. But for simplicity, we will separately refer for the rest of the manuscript to the two outputs, as binary and class. The former represents the traditional notion of landslide susceptibility or the likelihood of landslide occurrence. The latter refers instead to the estimation of the likelihood of a landslide belonging to a specific size class, reported in Table 3.

Fig. 3
figure 3

Conceptual framework of the proposed model

The proposed neural network architecture has a sequence of layers starting with: an input layer. This is followed by a Multi-Layer Perceptron (MLP). The aim of this MLP is to encode the input and provide the first classification, the binary one. This first MLP is then followed by another MLP, which takes as input the binary prediction and the encoded input trough the first MLP. This second MLP outputs the class prediction. For clarity, we have graphically translated the architecture descried above in Fig. 3.

The loss functions for the binary and class outputs are the binary cross entropy and the categorical cross entropy respectively (details in Appendix A). The overall loss of the model is calculated as the linear combination of both losses. The parameter of the linear combination weights the importance of one classification vs the other. This loss equation is formally defined as:

$$\begin{aligned} {\mathcal {L}}_{\text {tot}} = \alpha \cdot {\mathcal {L}}_{\text {BCE}} + (1 - \alpha ) \cdot {\mathcal {L}}_{\text {CCE}}, \end{aligned}$$
(1)

where \(\alpha \in [0,1]\) is an hyper-parameter.

Further information on the model architecture and activation functions are also provided in Appendix A.

6.2 Performance metrics

Throughout the landslide literature, the most common form of evaluation metric for landslide susceptibility is the Area Under the Receiver Operating Characteristic (ROC) curve, or AUC (Reichenbach et al. 2018). For consistency, the AUC will be used as the evaluation metric to compare our binary model with respect to a binomial GLM baseline. This will also be valid for the evaluation of the class output, as we will compute a ROC curve for each class. Moreover, there are also additional formats of the AUC that average the performance across classes and allow for a single value to reflect the performance of the model on all classes. In this study, we will use both the one-vs-one weighted average and one-vs-rest weighted and unweighted average AUC to evaluate the performance of the class output. The one-vs-one approach generates a ROC curve for each pair of classes, where one class is considered as the positive example and the other as the negative example. Then, it computes their AUC and average them weighting each value based on the total number of examples belonging to the class pair.

The one-vs-rest approach generates a ROC curve for each class, where one class is considered as the positive example and the rest of the classes as the negative examples. Then, it computes their AUC and average them. In the weighted version, the weights are computed based on the total number of examples belonging to the selected class. The weighted version of these metrics is important because it is less affected by the class imbalance present in the dataset.

These type of metrics are slight modification to the common AUC metric already used in the geomorphological literature. We opted to use them to provide a more complete model performance description and further details on this suite of performance metrics is reported in the Appendix A.

6.3 Influence of the binary and the class components

The performance assessment we opted for also includes a loss weight ratio test. The aim of loss weight ratio testing is to be able to better understand the impact that the presence of the first output has on the second, and vice versa, as well as to find the optimal combination of losses in a Neural Network.

By prescribing loss weights in a multi-output model (adjusting \(\alpha\)), it is possible to specify the extent to which each output’s loss contributes to the overall loss of the model. During the process of loss weight ratio testing, 11 combinations of weights at equally spaced integer intervals are tested, with \(\alpha\) going from 0 to 1 at steps of 0.1. More information on this test is provided in Appendix A.

6.4 Ablation analysis

To understand the importance of each feature or set of features, we perform an ablation analysis. This essentially consists in removing predictors from the dataset and retrain the model. The difference observed when using and not using a given predictor can be used as a measure of importance or explanatory power. In our dataset, we have 19 predictors. Eighteen of these can be thought as nine distinguishable signals that we convey to the model in a dual form, by using their mean and standard deviation values per SU. We recall here that the area of SU does not have a dual representation (refer to Table 1). It can therefore also be said that there are 10 independent categories of predictors, each one reflecting distinct attributes of the slope units. When conducting the ablation study, rather than removing a single component of the dual signal, only to leave its counterpart in the dataset, we opted to remove both. The results will be shown in Sects. 7.1.1 and 7.2.1 , for the binary and class models, respectively.

7 Results

7.1 Binary benchmark

The first stage of our modeling protocol features a benchmark step where the binary component of our HNN is compared to a traditional binomial GLM (Atkinson and Massari 1998; Lombardo and Mai 2018), for the latter represent the most common modeling routine used to assess landslide susceptibility (Budimir et al. 2015; Reichenbach et al. 2018).

Fig. 4
figure 4

Panel a summarized the evolution of the binary model through the epochs. Panel b offers a performance comparison between our binary model based on NN and a binary baseline based on a binomial GLM

Figure 4a shows the tests we have run to select the best binary model. The abscissa reports the number of epochs we tested our model for, up to 500 iterations. Examining the loss measured across epochs, the minimum corresponds to the 27th epoch, which we selected as our best model. Figure 4b shows the difference in performance between the best model explained before and the baseline. Specifically, our binary NN (AUC = 0.88) better performs compared to the binomial GLM (0.82) one, albeit both fall in the excellent performance class proposed by (Hosmer and Lemeshow 2000). This slight difference is noticeable in Fig. 5 where in the central portion of the study area, the co-seismic susceptibility is shown to have been estimated slightly higher in the baseline. However, the baseline also produces high susceptibility in the peripheries (e.g., north-western and south-eastern sectors) where no landslides occurred, this being the reason for the difference in performance among the two models.

Fig. 5
figure 5

Baseline a Vs NN b predictive models translated into map form. The two susceptibilities capture the overall pattern with the main differences being due to how smoothly or abruptly the probabilities transition to the adjacent SUs

7.1.1 Binary ablation tests

Figure 6 provides an overview of the predictor importance for the binary model. This operation essentially corresponds to a Jacknife test (see, Lombardo et al. 2016; Shrestha and Kang 2019), where one variable is removed at a time, measuring the drop in performance as the model loses supporting information. Therefore, important predictors induce a large performance drop. This is clearly the case for the PGA, whose removal implies a median decrease in performance of approximately 0.07 AUC, leading the model from outstanding to excellent performance. This is not surprising because the PGA represents the role of ground shaking, which is the most important variable controlling the spatial distribution of earthquake-induced landslides (e.g., Lombardo et al. 2019; Tanyaş et al. 2019). As for the remaining predictors, they do not significantly differ one from the others for they induce performance drops similar in magnitude.

Fig. 6
figure 6

Ablation test run for the binary model. The AUC reported to the y-axis is measured each time a predictor variable is removed from the initial set. We stress here that the removal of a predictor implies taking away both the corresponding mean and standard deviation values per slope unit. The green line corresponds to the best binary model shown in Figure 4a, obtained at the 27th epoch

7.2 Class HNN component

Our HNN concatenates the binary model reported above to the four landslide-size-classes. To assess whether this component performs well, similarly to the previous figure, we monitored the HNN model as it converged to its final form. We stress here that our HNN sequentially links the output of the binary case to the model component that probabilistically distinguishes the four landslide size classes. This is shown in Fig. 7a, where the loss estimated for class model converged to the best solution at the 26th epoch. In panel 7b, this performance is further described in a separate form, one ROC curve for each of the classes under consideration. There, the effect of the sample size can be seen across the landslide-size-classes. Class 0 and class 1 returned smooth ROC curves as they are constructed by using large data samples. This becomes rougher already for Class 2, due to the reduced amount of landslide with an average dimension and it is further exacerbated for the ROC curve of Class 3, where the frequency of landslide with extreme sizes is significantly less.

Fig. 7
figure 7

Panel a summarized the evolution of the class model through the epochs. Panel b offers a performance overview among the four classes

Irrespective of the specific Class, our HNN suitably recognizes the four landslide size groups, with outstanding performances according to Hosmer and Lemeshow (2000).

The resulting susceptibility estimates are translated in map form in Fig. 8. There, it is possible to appreciate the different information our model provides. The probability Class 0 map (not to have a landslide) is essentially the inverse of the probability Class 1 (to have a small landslide). This is expected because small landslides constitute the vast majority of the landslide size distribution. As a result, the two maps appear to depict inverse spatial patterns. As for the probability Class 2 and Class 3 maps, they provide the complementary information we sought when we devised this experiment. In fact, specific slope units are highlighted with probability of landslide size occurrence, and the number drastically decreases from one class to the other, as also empirically expected.

Fig. 8
figure 8

Probability of landslide-class-size occurrence

7.2.1 Class ablation tests

In analogy to the analyses run for the binary component of our HNN model, we featured a series of ablation tests also for the class model. These are shown in Fig. 9, where the effect of the predictors’ removal is measured for the four landslide size classes separately. Similarly to what happened in the case of the binary ablation tests, for the Classes 0 and 1, taking away the PGA (both its mean and standard deviation per SU) induced the largest drop in performance among all the predictors. Interestingly, as the landslide size class increased, the drop induced by removing the PGA drastically reduced. For instance, in the case of Class 2 removing the PGA still induces the largest drop in performance among the predictors, but the drop is almost in line with removing any of the others. As for the Class 3, here PGA is not the most relevant predictor anymore. In fact, none of the predictors’ removal cause a significant drop in the modelling performance. In other words, in Class 3, none of the predictors dominate the contribution of others. We believe this behavior to be linked to the fact, that extremely large landslides may not be only linked to the ground motion but they may rather initiate because of very localized landscape characteristics. For instance, structural features have long been recognized as factors that control the occurrence of large landslides (e.g., Chigira and Yagi 2006; Peart 1991; Tanyas et al. 2021). This is due to the fact that weakness planes could make hillslopes kinematically susceptible even under the conditions of strong material properties and/or low ground shaking (Goodman et al. 1989). This makes the identification of weakness planes quite important, although it mostly missing in regional scale landslide susceptibility assessment (Ling and Chigira 2020) including this study as such surveys are practically not possible to carry out for large areas. This means that in the case of Class 3, none of the variables may replace the role of that missing component individually. However, the high modelling performance also shows that the compound effect of all variables acts as a proxy to identify those mapping units associated with the largest landslides.

In fact, a closer look at the predictor’s removal effect shows that, for instance, VRM is as one of the variables responsible for the largest drop in perfomance. We recall here that the VRM expresses how rough the terrain is, this being linked in several studies to the rock mass strength (e.g., Tanyaş et al. 2019). In other words, when the roughness is high, the landscape is likely made of rock masses with higher strength and whenever the roughness is low, this may imply soft materials that drape over the landscape, accommodating the deformation of induced by local tectonic regimes in the form of hills rather than steep scarps. This type of considerations can be behind the role of VRM with respect to extremely large landslides as their failure initiation maybe linked to weakness planes such as fractures or fissures dissecting rocky outcrops. Conversely, a landscape characterized by low VRW values mostly correspond of gentle slopes, where the tectonic response likely gives rise to ductile rather than fragile deformations.

Fig. 9
figure 9

Ablation test run for each landslide size class estimated via the class model. For each class, the AUC reported to the y-axis is measured each time a predictor variable is removed from the initial set. We stress here that the removal of a predictor implies taking away both the corresponding mean and standard deviation values per slope unit. The horizontal lines correspond to the best binary model shown in Fig. 7a, obtained at the 26th epoch

7.3 Loss ratio consideration

To complete the description of the HNN we propose, we opted to share the results of additional tests we have run to assess how binary and class component relate to each other in terms of performances.

Unsurprisingly, as shown in Fig. 10a, the AUC values produced when the outputs are turned off are substantially below those when it is turned on. In other words, we limited the influence provided by one model (binary) onto the other (class) while checking the effect in terms of performance drop, and viceversa. For instance, in the case of the binary output, its AUC value was initially less than 0.5, meaning that it was worse than a random selection. The remainder of results are more similar and found in the 0.86 to 0.88 range. But, as we zoom in towards the upper bound of the AUC distribution, Fig. 10b unveils an interesting behavior. The maximum performance of the binary output occurred when the weightings were 10 and 0 for binary and class outputs respectively. This means that the presence of the class output only hindered the performance of the binary output. More interestingly, this was not the case for the class output. The second-worst performance of the class output was when the binary output was turned off, better only than when the class output itself was turned off. In fact, as the class output’s weight decreases on subsequent iterations, its performance generally improves, reaching a maximum on the antepenultimate iteration of the eleven trials, when the weights are 8 and 2 for the binary and class outputs respectively.

This demonstrates that the presence of the binary output can be leveraged to improve the performance of the class output. The implications of this finding are that the model should be tuned differently depending on the desired output. When making predictions using the binary output alone, the class output should be turned off, whereas when making predictions with the class output, the setting should be considered much more carefully.

Fig. 10
figure 10

a Generic overview of how the binary and class model interacted with each other; b details of the same overview in top AUC range

8 Discussion

The model we propose is the first of its kind in the literature for it combines the prediction of which slope are susceptible to fail in response to a ground motion disturbance, together with the prediction of which size-class the landslides are expected to fall into, per slope unit. When we originally thought of implementing this model, we envisioned the landslide size component to be modeled closely to what has been done in Gallen et al. (2017); Lombardo et al. (2021). Gallen and co-authors proposed the first model able to spatially predict landslide sizes via Newmark sliding block analysis, a physically-based model. Lombardo and co-authors proposed something analogous but framed in the context of statistical modeling. In this work we took inspiration from the second example mentioned above, where the landslide size was modeled by using the continuous representation of the planimetric areas transformed into the logarithmic scale. Here, we initially envisioned a Hierarchical Neural Network capable of jointly modeling landslide occurrences and sizes, further improving on previous research by featuring a hierarchical model and also by keeping the planimetric area in its original squared meter scale. With this idea in mind we have run a number of tests but likely due to the heavy tailed distribution of the landslide sizes, our HNN struggled to reproduce the whole distribution. The binary component was never an issue, as also demonstrated by the number of successful ANN applications in the context of landslide modeling (Conforti et al. 2014; Amato et al. 2021a, b). As for the landslide size component, due to the long tailed shape of the landslide planimetric distribution, we opted for a solution with a dual advantage. We used a common solution to this type of problems, by slicing the distribution into four bins (Goel et al. 2010) and adopting a multi-label classifier to take on the task of probabilistically recognizing each bin. In addition to the fact that this type of solution is particularly common, we also exploited the added value brought by the Neural Network architecture itself. In fact, this type of hierarchical models are well studied within the artificial intelligence community, allowing us to bring years of already built-in experience to solve a problem that was still not addressed within the geomorphological community.

This structure produced outstanding results across the two model components, (binary and class). This is shown in Figs. 4 and 7 . And, such observation opens up the question whether this type of models should represent the next direction to be taken when modeling natural hazards. In fact, especially within the geomorphological community, the vast majority of applications are confined within the scope of predicting only landslide occurrences (i.e., susceptibility assessments). Much more rarely, the community extends its reach to assessing the hazard associated with a potential landslide occurrence. This happens because the definition of landslide hazard features a temporal and size components which are not easily modeled. The temporal element is rarely accounted for because until the appearance of semi-automated mapping routines, monitoring a given area and manually mapping multi-temporal inventories has been a very difficult and expensive task (see explanation in Guzzetti et al. 2012). As for the dimension of the landslides, for decades the community has been focused on estimating the landslide event magnitude (Malamud et al. 2004; Tanyaş et al. 2018), a single measure valid for a large population of landslides. Only recently, Lombardo et al. (2021) have proposed a model that spatially estimates how prone slopes may be to release landslides of specific sizes. The combination of the landslide susceptibility and size elements proposed in this work belong to a currently uncharted territory, and their use is not featured in any international guidelines. However, we believe that the type of information compressed in Fig. 8 can be really important to support decision makers and generally for territorial management practices. Knowing or at least expecting which slope is more likely to release an extremely large landslide is an information that at least could save money when investing on slope stabilization. And at its best, it could help minimize the risk to human lives.

It is fair to point out that this type of models are at their infancy stage within the natural hazard community. Therefore, their potential success relies on further testing, before even considering making them part of any well established procedure. For this reason, we are sharing the data as well as the codes we wrote during this experiment. In such a way, we intend to promote repeatability and reproducibility of analogous experiments. For instance, we have currently developed this tool in the framework of co-seismic landslides. But, it is still unclear whether the same HNN is applicable in case of event-based rainfall induced landslides. Or, further tests need to be run on multi-temporal inventories.

9 Concluding remarks

We present the first hierarchical data-driven model able to simultaneously estimate where landslides may occur and to which landslide size class they may belong. We believe this model to be particularly suited for landslide hazard assessment. We recognize that the proposed model does not yet yield a full landslide hazard assessment as it is blind to the expected temporal frequency of such landslides. It should also be recognized, however, that the proposed model provides significantly more information than existing landslide susceptibility models. What makes this model particularly appealing is that its architecture hierarchically links the binary and the class elements. In other words, all the results reported in this manuscript are produced via a single data driven model.

In addition to future tests of the very same model—which we hope to stimulate by sharing data and codes with the readers in a github repository, we expect to venture into additional variations of the present scheme. For instance, the simplest iterations we envision simply reproduce similar tests on rainfall induced landslides or on geomorphological inventories that cover a large time span rather than being linked to a single event. And, we are already planning to extend the complexity even further. In fact, withing the very same hierarchical structure, we could also add the type of landslides in between the binary and the class components. This could provide a complete description of the landslide process, which would need to be “just” extended in time to fully solve any requirement reported in the standard landslide hazard definition.

10 Software and data availability

In the hope of promoting similar applications, we have shared data and codes in a github repository accessible at: click here.

The original landslide inventory can also be downloaded from the global repository of earthquake-induced landslides, accessible at: click here.