Comparing methods for determining landslide early warning thresholds: potential use of non-triggering rainfall for locations with scarce landslide data availability

Rainfall intensity-duration landslide-triggering thresholds have become widespread for the development of landslide early warning systems. Thresholds can be in principle determined using rainfall event datasets of three types: (a) rainfall events associated with landslides (triggering rainfall) only, (b) rainfall events not associated with landslides (non-triggering rainfall) only, (c) both triggering and non-triggering rainfall. In this paper, through Monte Carlo simulation, we compare these three possible approaches based on the following statistical properties: robustness, sampling variation, and performance. It is found that methods based only on triggering rainfall can be the worst with respect to those three investigated properties. Methods based on both triggering and non-triggering rainfall perform the best, as they could be built to provide the best trade-off between correct and wrong predictions; they are also robust, but still require a quite large sample to sufficiently limit the sampling variation of the threshold parameters. On the other side, methods based on non-triggering rainfall only, which are mostly overlooked in the literature, imply good robustness and low sampling variation, and performances that can often be acceptable and better than thresholds derived from only triggering events. To use solely triggering rainfall—which is the most common practice in the literature—yields to thresholds with the worse statistical properties, except when there is a clear separation between triggering and non-triggering events. Based on these results, it can be stated that methods based only on non-triggering rainfall deserve wider attention. Methods for threshold identification based on only non-triggering rainfall may have the practical advantage that can be in principle used where limited information on landslide occurrence is available (newly instrumented areas). The fact that relatively large samples (about 200 landslides events) are needed for a sufficiently precise estimation of threshold parameters when using triggering rainfall suggests that threshold determination in future applications may start from identifying thresholds from non-triggering events only, and then move to methods considering also the triggering events as landslide information starts to become more available.


Introduction
Landslides induced by rainfall are quite frequent in mountainous areas and represent a threat to property and population (Froude and Petley 2018). Landslide early warning systems (LEWS) are a mitigation measure that, if coupled with adequate emergency plans, can save lives Guzzetti et al. 2020). The analysis of the conditions, in terms of precursor characteristics, for which landslides are foreseen is essential for LEWS. It is customary to consider only precipitation as a precursor and to derive rainfall thresholds based on two characteristics: rainfall event duration (D) and (mean) intensity (I) or cumulated depth, also known as event rainfall (E = I×D) (Caine 1980;Guzzetti et al. 2008). Most of the works consider a powerlaw threshold, derived from the analysis of rainfall events associated to past landslide occurrence (Brunetti et al. 2010;Segoni et al. 2018). Some others, more recent, consider both triggering and non-triggering events, in order to have a complete assessment of the performances of the derived threshold in terms of the so-called confusion matrix, i.e., the count of true/false positives/negatives (Berti et al. 2012;Staley et al. 2013;Peres and Cancelliere 2014;Postance et al. 2018;Marino et al. 2020). A third possibility, that is indeed seldom considered, is to use the nontriggering events only. To the knowledge of the authors, this method has been discussed only in the gray literature (http:// www.cswcs.org.tw/AllDataPos/DownLoadPos/Root3/1.3-Glade_121106-print.pdf, last accessed 16 April 2021) or just as a way to provide an upper bound of the uncertainty range related to threshold identification (Zêzere et al. 2015). This method may be particularly useful when an area is potentially prone to landslides but only a few cases have been recorded. Prone areas with little or no recorded landslide events may come out also as a result of susceptibility analysis (Reichenbach et al. 2018). In fact, a landslide may occur in an area where adequate rainfall records started to be collected only a few decades ago, covering a period where no other landslides have been likely recorded. In this case, the only available information on landslide occurrence is that, presumably, none of the preceding recorded rainfall events has triggered landslides. Still, in this case, protecting the area with an early warning system may be useful as a landslide mitigation measure and/or be requested by who has the responsibility to manage landslide risk (e.g., civil protection agencies).
In this paper, we analyze the three mentioned methods and compare them with respect to three statistical properties: robustness, sampling variation, and performance in terms of the confusion matrix. Robustness measures how the determination of threshold is affected by the presence of errors. To this end, we consider errors induced from exchanging triggering events with non-triggering ones and vice versa, as this may occur for the incompleteness of landslide inventories, or an inappropriate attribution of a given rainfall event to a landslide (Gariano et al. 2015). It should be mentioned that many other sources of uncertainty do exist in threshold determination which have been the subject of other studies (Nikolopoulos et al. 2014(Nikolopoulos et al. , 2015Marra et al. 2016Marra et al. , 2017Destro et al. 2017;Rossi et al. 2017;Marra 2019;Gariano et al. 2020), but that the illustrated erroneous attribution may be considered not quantified so far.
Sampling variation here refers to the fact that threshold parameters are usually estimated from a limited sample and that the true values of the threshold parameters are unknown in the real world. Monte Carlo simulation allows simulation of repeated sampling of a given sampling size and to compute the sampling distribution. The standard deviation of the sampling distribution, known as the standard error, can be then plotted as a function of sample size, to investigate how sensitive a threshold is to the given sample available. In other terms, sampling variation provides the precision of a given estimation of the threshold parameters for a given sample size. This may help answer the question: how many landslidetriggering storms constitute a statistically significant sample for threshold determination?
Regarding performances, it is customary to use the confusion matrix or receiver operating characteristics (ROC) analysis, i.e., to evaluate the quality of the predictions related to a threshold in terms of the corresponding frequency of false/true positives/negatives, often summarized by metrics such as the true skill statistic (Peirce 1884).
Our analysis is based on the use of synthetic rainfall-landslide data generation through the Monte Carlo simulation approach, combining stochastic generation of rainfall events with physically based infiltration and slope stability models, presented in our previous work (Peres andCancelliere 2014, 2018), calibrated and validated for typical hillslopes prone to landslides in the Peloritani Mountains area, Sicily, Italy.

Methods and data
Threshold estimation methods We refer to the case of power-law intensity-duration thresholds: where I and D are rainfall event intensity and duration, respectively, while a and b are the two parameters to be estimated based on the data. In our scheme, as usual, we consider that an alarm for possible landslide occurrence is issued when I ≥ aD b . The methods for estimation of parameters that we compare are the following (Fig. 1): 1. Frequentist P, i.e., the so-called frequentist method (Brunetti et al. 2010;Peruccacci et al. 2012) applied to rainfall events associated to landslides (triggering rainfall only-"positives") 2. Frequentist N, the frequentist method applied to rainfall events not associated to landslides (non-triggering rainfall only-"negatives") 3. Frequentist PN, namely methods using both triggering and non-triggering events (positives and negatives) and an optimization procedure applied to an indicator based on the confusion matrix The Frequentist P method is widely described and used in previous works. It is the most commonly adopted method for threshold identification (Piciullo et al. 2017) and software has been made available to apply it to any suitable case study (Melillo et al. 2018). Here, we recall that it basically consists in applying the following procedure: (i) linear regression of the data in the double-logarithmic plane, where only triggering events are considered, and (ii) parallel translation of the regression line such that the probability of non-exceedance is a fixed (low) value. In performing the analysis, the residuals of the regression are considered to be normally distributed with zero mean (cf. Fig. 1). In the case of Frequentist P, it is customary to choose a 5% of probability of non-exceedance by triggering events. Conversely, for the Frequentist N, we consider the choice of a threshold with 5% of probability of exceedance by non-triggering events. For the third method, we search the threshold that maximizes a given function of the confusion matrix, representing the trade-off between correct and wrong predictions. For the purpose, we apply the particle swarm optimization algorithm implemented within MATLAB®, so to find the global maximum of the true skill statistic (Wilks 2006): where F(I > aD b | P) is the true positive rate or ROC sensitivity, an estimation of the conditional probability to predict a landslide (a positive, P) given that it occurred, and F(I > aD b | N), the false positive rate, an estimation of the conditional probability that a landslide is predicted given that no landslide (a negative, N) has been observed. For a threshold that perfectly separates triggering and non-triggering rainfall events, T S = 1, while T S = 0 for a threshold performing no better than a random guess. Indeed, the three illustrated methods can all be framed in terms of the last one (Frequentist PN), as special cases where only a column of the confusion matrix is taken into account (see Fig. 2). This consideration provides already some insights on the possible performances of the methods: Frequentist P aims at obtaining a fixed (low) probability of false negatives (missed alarms), Frequentist N aims at obtaining a fixed (low) probability of false positives (false alarms). Hence, there is no control of the false positives (false alarms) and of the false negatives (missed alarms), respectively; so this research aims also at understanding which possible tendencies can be derived for the part of the confusion matrix that is not explicitly taken into account (see the "Results and discussion" section). On the other side, by construction, the threshold from Frequentist PN, built using all elements of the confusion matrix, allows taking into account the trade-off between false and true predictions. Another aspect that is noteworthy to be explicitly mentioned is that the number of negatives (non-triggering rainfall events) is generally of one order of magnitude greater than the number of positives (see later in Table 2). This has impacts on the investigated properties of the three analyzed methods.

Assessment of statistical properties
Triggering and non-triggering events are generated via the Monte Carlo simulation approach that we developed in our previous work, as cited in the "Introduction"-some more details are recalled in Section "Synthetic data generation". What is relevant here is that Monte Carlo simulation allows generating a virtual sequence of rainfall events as long as desired, and then to dichotomize

Original Paper
Landslides 18 & (2021) the events into triggering and non-triggering based on slope stability analysis. Then, the generated events can be analyzed according to the three methodologies illustrated in the previous section. For each sample size M (from 2000 to 100,000 rainfall events), n = 30 realizations are drawn, and thresholds are determined for each sample, so to explore sampling variation. This variation consists in the dispersion of threshold parameter estimates as obtained from different samples of the same size (Everitt and Skrondal 2010). Then, sampling variation is visualized by plotting the thirty thresholds for each sample size, as well as the normalized standard errors of threshold intercept and slope as functions of sample size. The normalization is made with respect to the reference threshold parameter values obtained using the entire dataset of 3,000,000 generated rainfall events (see the "Synthetic data generation" section).
For what concerns robustness, it must be acknowledged that somewhat different meanings are used for this concept. Hence, it may be useful to recall a standard definition. According to the Oxford dictionary of Statistics (Upton and Cook 2008), robustness measures how "the outcome of a statistical procedure is affected by the presence of a small number of unusual or incorrect values." Here, to measure robustness, we consider a "polluted" dataset where the p % of the triggering rainfall events is randomly assigned to non-triggering events. In this way, we try to mimic the effect of missing landslide events in inventories. At the same time, an identical number of non-triggering rainfall events is randomly assigned to triggering events, to emulate inappropriate attribution of rainfall events to a landslide, which may occur mainly because the nearest rain gauge available records a rainfall event that differs from the one in the known location of the landslide, or for an imprecise knowledge of the triggering instant. Here, we consider the case of p = 10%. This scheme keeps unvaried the number of triggering and non-triggering events for a given simulation. For the sake of clarity, it may be noteworthy to mention that it would have been unrealistic to take the p % of nontriggering rainfall events and attribute it to landslide events, as the number of the erroneous triggering rainfall events would generally exceed the number of true landslide events (cf. end of the "Threshold estimation methods" section and later Table 2).
Regarding performance of thresholds, as discussed again in the "Threshold estimation methods" section, some entries of the confusion matrix are assigned by construction from the methods (with respect to their own calibration sample). Nevertheless, for Frequentist P and Frequentist N methods, the false positive rates and true positive rates are respectively unassigned, so it is still useful to plot the true skill statistic T S , which can be used as a single metric of performance combining the four entries of the confusion matrix. For a more rigorous assessment of this aspect, the T S is computed using the thresholds estimated on each sample size against the reference complete dataset of 3,000,000 rainfall events. In fact, the value of the T S computed in this way-given that the scatter plot of both triggering and non-triggering events remains the same-is a measure of how the threshold is placed within the I-D plane. This is particularly useful as a complement to the information given by the variation of the standard error of threshold parameters with sample size. It is known that two thresholds having different a and b may instead be located quite nearly in the I-D plane, as a lower value of a can be compensated by a lower value of b (less steep). Also, due to the logarithmic scale, thresholds with the intercept parameter that differ quite significantly can be instead very close in the log(I)-log(D) plane (see for instance Fig. 4 and Table 2, PN vs. P threshold for the case of τ = 0).

Synthetic data generation
The stochastic rainfall and the infiltration and slope stability model are calibrated based on climate, geomorphological, hydrological, and geotechnical characteristics of hillslopes in the Peloritani Mountains area, Sicily, Italy (Fig. 3). The area has been described in many other studies (De Guidi and Scudero 2013;Schilirò et al. 2015;Stancanelli et al. 2017;Cama et al. 2017), and a detailed description on how the models have been calibrated and validated is presented in . Hence, we illustrate only briefly the approach for data generation-the reader is referred to the cited study for more details.
The stochastic model generates rainfall events that have a duration D and a constant intensity I = E/D. Total event depth E and duration D are generated through a bivariate distribution obtained via a copula approach, while storm interarrivals U are generated through a separate independent probability distribution. Generated rainfall events are separated by at least 24 h of no rain. The generated rainfall data are inputted to the infiltration model, which is the extension of the Iverson (2000) diffusive model for a finite soil depth, based on a linearization of Richards' vertical infiltration equation. This solution is implemented within TRIGRS v.1 (Baum et al. 2010). We have made our own MATLAB® code to better couple the stochastic model with the infiltration model. The infiltration model provides the pressure head to compute the factor of safety according to the infinite slope equation. The initial conditions to each rainfall event are computed by a drainage model that takes the maximum pressure head computed for the previous rainfall event and makes it decay according to a negative exponential law, with time constant τ, resembling the linear reservoir model for sub-horizontal drainage. More complex models could have been applied, such as those that consider variably saturated conditions during infiltration. For instance, this is done by the more recent TRIGRS v.2 software and its parallel implementation (Alvioli and Baum 2016;Palazzolo et al. 2021). However, the diffusive model based on TRIGRS v.1 which we apply here has proven sufficiently reliable for many cases, including the area considered in this study. This allowed to run the high number of simulations required in this study in a reasonable computation time.
The rainfall stochastic model has been calibrated on the hourly rainfall data observed at Fiumedinisi for a period of about 9 years (from 5 January 2002 to 23 February 2011). This rain gauge is near the Giampilieri area where on 1 October 2009 a slide-to-flow regional event was occurred, killing 37 persons. Rainfall model calibration has been performed separately for three statistically homogeneous periods of the year (JFMA, MJJA, SOND), leading to the distributions and parameters reported in Table 2 of . The average of the number of rainfall events per year is 45.36, which conversely means that 1000 events correspond to about 22 years of recordings.
Regarding the hydrologic and geomechanical properties for the landslide model, these are shown in Table 1.
The value of τ for the drainage model can also be varied, so to speculate on the possible relevance of antecedent rainfall memory of the hillslope. The smaller the τ, the faster the soil drains during dry intervals between rainfall events, and the lesser antecedent rainfall is important for landslide occurrence. This means that in the limit case of τ = 0, antecedent rainfall has no importance and I and D are the only factors influencing landslide triggering (ideal situation). We have considered three cases τ = 0, τ = τ 0 = 2.75 days and τ = 2τ 0 = 5.49 days, performing a sensitivity analysis with respect to the I-D model structural uncertainty-as this parameter controls the degree of confusion between triggering and nontriggering rainfall events described in terms of the considered predictors I and D. In the case of τ = 0, there is a curve that can perfectly separate triggering and non-triggering events, though this differs slightly from a power-law (straight line in the logD-logI plane). Figure 4 compares the thresholds for a dataset of 3,000,000 rainfall events, which are assumed as the true thresholds for each method, useful for computing the normalized standard error of threshold parameters varying sample size (cf. "Methods and data" section). Table 2 shows, for each value of the recession constant τ, the ratio between the number of landslides and the number of rainfall events, and the thresholds corresponding to the entire datasets.

Main case: intermediate structural uncertainty (Peloritani Mountains)
We present the results relative to the case of τ = τ 0 , which represents the intermediate level of structural uncertainty, and the value that can be deemed representative for the case study of the Peloritani mountains. From Fig. 5, it can be seen that the Frequentist P method is the least robust. Errors tend to bring to an underestimation of the threshold parameters that would further increase the already high false alarm rates typical of this method. The Frequentist PN is more robust, but less than the Frequentist N. This happens because the former method is still affected by the errors in triggering rainfall data, while the latter is not significantly affected, as the impact of errors on non-triggering rainfall is lower in a relative sense. The same figure provides also some insight on sampling variation (width of the area spanned by the black lines). The low sampling variation of the Frequentist N is evident. However, for a better comparison of the Frequentist P and PN methods, it remains still useful to see how single threshold parameter estimation is affected by sample size (Fig. 6). As can be seen from the plots of normalized standard error varying sample size, for both parameters the Frequentist PN is generally way less variable than Frequentist N. Starting from a sample size of 30,000 rainfall events, both methods have very close normalized standard errors for the intercept parameter.
Regarding the performances, as already mentioned, the PN method performs the best, by construction (Fig. 7). What is interesting to see is that the Frequentist N could tendentially provide a performance closer to the best trade-off between correct and wrong predictions than the Frequentist P. Indeed, it may be true that this last method is cautionary as it reduces false negatives (potentially missed alarms) as much as possible, but it is also true that the false positive rate is too high to a point that it can induce a distrust in the early warning system eventually built on a sodetermined threshold (cry-wolf syndrome). This plot gives also additional information on sampling variation of the thresholds; with respect to Fig. 6, it reflects the variability of the position of the threshold taking into account both threshold parameters simultaneously. In fact, as already mentioned at the end of the "Assessment of statistical properties" section, for a given set of parameters, another threshold with a higher intercept may have a steeper (negative) slope that compensates the intercept and makes it quite similar to the given threshold. The T S variability does not suffer from this issue, and reflects the actual position of the threshold in the I-D plane. It can be seen that performances get stable for the PN method starting from 15,000 rainfall events (about 100 landslide-triggering storms), while for the P method threshold position starts to reduce again for 30,000 rainfall events (about 200 landslide-triggering storms, cf. Table 2). These minimum data availability requisites-that, in addition, are to be deemed as referred to an area with homogenous landslide phenomenology-are not met by several past studies focussed on threshold determination: just to give an example, the Caine (1980) global threshold was based just on 73 landslide-triggering storms (non-triggering events were not considered).

Sensitivity analysis for different levels of structural uncertainty
The results presented vary with the level of structural uncertainty, which can be shown from analogous plots relative to τ M = 0, and τ = 2τ 0 . In the first case (Figs. 8,9,and 10), which is an "ideal" situation, it corresponds to a negligible structural uncertainty for the I-D model; both the Frequentist PN and Frequentist N methods are very robust, despite the Frequentist P, which is revealed again to be too sensitive to the presence of errors in the dataset (Fig. 8). These errors induce once more a very high underestimation of the threshold that may induce a dramatic increase in false positives. Regarding the sampling variation of the parameters, Frequentist PN and Frequentist P have a low sampling variation (Fig. 9). However, when looking at the performances (Fig.  10), clearly due to the low structural uncertainty of the I-D model in this case, the Frequentist P method performs well as the Frequentist PN, while the Frequentist N has slightly lower performances. This last method still has good performances, and represents a cautionary option as it does not uselessly leave any triggering events below the threshold (no false negatives), which in the case of the Frequentist P still remain, by construction, the 5% of total positives.
In the case of an increased level of structural uncertainty (recession constant τ = 2τ 0 ), robustness improves for both Frequentist P and PN methods (Fig. 11). This occurs when, as in this case, the impact of these errors may be comparable with the level of structural uncertainty. On the other side, and differently from the other two cases analyzed above, sampling variation of the threshold intercept for the Frequentist P method is lower than that for the Frequentist PN, while for the threshold slope, normalized standard errors have similar values for these two methods (Fig. 12). Nevertheless, when looking at the plot of the true skill statistic vs. sample size (Fig. 13), it can be seen that for the PN method, the dispersion of the T S is way lower than that for the P method. This again means that, while single parameters may vary widely for the PN method, the position of the thresholds remains quite stable in this case, and thus again sampling variation of the threshold as a whole is dramatically lower for the PN and N methods with respect to the P method. Looking at the same plot, it can be seen that performances of this last method are dramatically lower than those of the other two methods.

Conclusions
In this paper, we have analyzed some relevant statistical properties of rainfall intensity-duration thresholds for landslide early warning determined by three different methods, i.e., based on the analysis of (i) triggering rainfall events only (Frequentist P), (ii) non-triggering rainfall events only (Frequentist N), and (iii) both triggering and non-triggering events (Frequentist PN). The first approach is the most commonly applied in the literature, while the last method has been mentioned only by a few scholars, mainly just as a way to provide an upper bound of the uncertainty range related to threshold identification. The third approach has been applied in more recent studies, and it presents the advantage of allowing the estimation of all entries of the confusion matrix, and thus to optimize performances in terms of both true positive and false positive ratios, which makes this approach the one that performs best in terms of the skill in separating triggering from non-triggering events. The results show that it is also a quite robust method. On the other side, Frequentist P method is poorly robust, can potentially lead to a high number of false alarms, and seems quite unstable with respect to the specific sample at hand. For a reasonable application of the Frequentist P method, at least 200 landslide-triggering storms should be available for a given geomorphologically homogeneous area, as otherwise the estimation of threshold parameters would be too imprecise (high sampling variation). For the frequentist PN method, the position of the threshold can be deemed stable starting from a sample with one hundred triggering events (half than that for the Frequentist P).
Regarding the Frequentist N method, the performed analysis shows that it is the most robust and least variable. It also seems to have a greater potential than the Frequentist P method in providing performances that are a better compromise between false alarms and missed alarms. All these considerations lead to the conclusion that Frequentist N has not received the attention that it must deserve as a method for determining thresholds in its own right. A practical implication of the good properties of the Frequentist N method is that it could be a valid method to derive rainfall thresholds in locations where only few landslides have been recorded. Another point is that given the low robustness of the Frequentist P method, very low data points in the I-D plane

Fig. 5
Robustness test: for the three methods investigated, black lines represent thresholds for the error-free dataset while the red lines represent those derived from a polluted dataset (p = 10% of triggering events are exchanged with non-triggering, and vice versa): sample sizes 5000 (a) and 50,000 (b). The different thresholds for the Frequentist N method vary little and are not visible in the plot must be carefully checked for errors and thus possibly removed. If they are not errors in the dataset, the presence of low-intensity triggering points is a sign that only intensity and duration may not be sufficient to obtain a threshold with the necessary reliability, and that approaches taking into account predisposing hydrological factors must be investigated Greco 2016, 2018;Thomas et al. 2019Thomas et al. , 2020Marino et al. 2020). As an overall conclusion, the analysis we have carried out suggests that future studies for threshold determination for newly investigated areas may start using the Frequentist N method first, and then move to the Frequentist PN as more data becomes available, as this method delivers globally the best of the techniques explored herein.

Original Paper
European Commission, and the Programma Operativo Nazionale Governance e Capacità Istituzionale 2014-2020 -Programma per il supporto al rafforzamento della governance in materia di riduzione del rischio ai fini di protezione civile CUP (Program to support the strengthening of governance in the field of risk reduction for civil protection purposes). David J. Peres was supported by the post-doctoral grant on "Sviluppo di modelli per la valutazione di strategie innovative di gestione delle risorse idriche in un contesto di cambiamenti climatici" (Development of models for the evaluation of new strategies for water resources management in a changing climate).
Availability of data and material Rainfall data used in this paper can be downloaded by request at http://www.sias.regione.sicilia.it/.
Code availability By request to authors. Funding Open access funding provided by Università degli Studi di Catania within the CRUI-CARE Agreement.

Conflict of interest The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/ 4.0/.