Standards for the performance assessment of territorial landslide early warning systems

Landslide early warning systems (LEWS) can be categorized into two groups: territorial and local systems. Territorial landslide early warning systems (Te-LEWS) deal with the occurrence of several landslides in wide areas: at municipal/regional/national scale. The aim of such systems is to forecast the increased probability of landslide occurrence in a given warning zone. The performance evaluation of such systems is often overlooked, and a standardized procedure is still missing. This paper describes a new Excel user-friendly tool for the application of the EDuMaP method, originally proposed by (Calvello and Piciullo 2016). A description of indicators used for the performance evaluation of different Te-LEWS is provided, and the most useful ones have been selected and implemented into the tool. The EDuMaP tool has been used for the performance evaluation of the “SMART” warning model operating in Piemonte region, Italy. The analysis highlights the warning zones with the highest performance and the ones that need threshold refinement. A comparison of the performance of the SMART model with other models operating in different Te-LEWS has also been carried out, highlighting critical issues and positive aspects. Lastly, the SMART performance has been evaluated with both the EDuMaP and a standard 2 × 2 contingency table for comparison purposes. The result highlights that the latter approach can lead to an imprecise and not detailed assessment of the warning model, because it cannot differentiate among the levels of warning and the variable number of landslides that may occur in a time interval.


Introduction
Operational landslide early warning systems (LEWS) aim at reducing the loss-of-life probability by inviting stakeholders (e.g., civil protection agents, administrators, lay people) to act properly in populated areas characterized, at specific times, by an intolerable level of landslide hazard (Calvello 2017). LEWS widely differ depending on the type of landslide they address and the scale of operation, which is related to the size of the area covered by the system. Two categories of LEWS can be defined on the basis of the scale of operation (e.g., Bazin 2012): (i) local LEWS (Lo-LEWS), dealing with a single landslide system at slope scale; and (ii) territorial LEWS (Te-LEWS), dealing with multiple landslides at regional scale. The adjective "territorial" is herein preferred over the most commonly used adjective "regional" to provide a more general name for all the LEWS employed over a wide area, e.g., a nation, a region, a municipal territory, a river catchment (Piciullo et al. 2018).
In the literature, there are several proposals schematizing the structure of LEWS and highlighting the importance of the relations among different system components, as well as the role played by the actors involved in designing and managing these systems. Di Biagio and Kjekstad (2007) employ a flow chart to outline four main sequential activities for such systems: monitoring, analysis and forecasting, warning, and response. Intrieri et al. (2013), elaborating on the well-known four-elements scheme of peoplecentered early warning systems proposed by the UNISDR (2006), describe LEWS as the balanced combination of four different components: design, monitoring, forecasting, and education. Calvello et al. (2015) state that the objectives of LEWS should be defined by considering the scale of analysis and the type of landslides, and they represent the process of designing and managing LEWS by a wheel with four concentric rings identifying the following: the necessary skills, the activities to be performed, the means to be used, and the basic elements of the system. Calvello (2017) illustrates the components of early warning systems for weather-induced landslides within a scheme based on a clear distinction among landslide models, warning models, and warning systems, wherein a landslide model is one of the components of a warning model and the latter is one of the components of a warning system. All these schematizations highlight the fact that all the identified system components are essential for LEWS to be effective, as the failure of any component means the failure of the whole system. Indeed, early warning systems are only as good as their weakest link as they can, and frequently do fail for a number of reasons (Maskrey 1997).
The Hyogo Framework for Action "priority for action 2" (i.e., identify, assess and monitor disaster risks, and enhance early warning) identifies as key activity the establishment of institutional capacities to ensure that early warning systems are subject to regular system testing and performance assessments (HFA 2005). The scientific literature reports many studies on LEWS, either addressing a single landslide at slope scale (Lo-LEWS, e.g., Pecoraro et al. 2019 and references therein) or concurrent phenomena over wide areas at municipal/regional/national scale (Te-LEWS, e.g., Piciullo et al. 2018 and references therein), yet the performance evaluation of the warning models employed within LEWS is often overlooked by system managers and researchers. Particularly for Te-LEWS, model performance is often assessed neglecting some important aspects peculiar to these systems, among which the occurrence of concurrent multiple landslides in the warning zone; the issued warning level in relation to the landslide spatial density in the warning zone; and the relative importance attributed, by system managers, to different types of errors (Calvello and Piciullo 2016). Indeed, in the literature, only few systems are described whose performance has been thoroughly assessed (Cheung et al. 2006;Restrepo et al. 2008;Martelloni et al. 2012;Lagomarsino et al. 2013;Calvello and Piciullo 2016;Piciullo et al. 2017a;Piciullo et al. 2017b).
A selection of indicators, available in the literature, to quantify the performance of both rainfall thresholds and EWS is presented in the following section. The paper aims at identifying the most useful ones for the performance evaluation of Te-LEWS. Moreover, the paper describes the results for the performance evaluation of the warning model adopted by the Te-LEWS operating in Piemonte, Italy (Tiranti and Rabuffetti 2010). The evaluation is based on the application of the EDuMaP method (Calvello and Piciullo 2016), considering landslides and warnings recorded in the different warning zones of the system from 2008 to 2016. The results of the performance assessment, carried out with the EDuMaP method, have been compared with the ones obtained using a 2 × 2 contingency table.
Performance assessment of territorial landslide early warning systems Rainfall threshold validation and performance of Te-LEWS In the last decades, rainfall thresholds for landslide occurrence were thoroughly investigated, producing several different test cases and relevant technical and scientific advances. A recent literature review on rainfall thresholds (Segoni et al. 2018a), on the scientific articles published in journals indexed in SCOPUS or ISI Web of knowledge in the period 2008-2016, highlights significant advances as well as critical issues about this topic. The main concern is the validation process, which is seldom carried out. Regrettably, only 38 papers out of 115 (33%) presented a correct validation analysis performed with an independent dataset, while 31 thresholds (27.0%) were validated using the same dataset used for calibration and 46 thresholds (40.0%) were published without any evaluation of their predictive capability. About 34% and 17% of the investigated rainfall thresholds are employed for early warning purposes in LEWS, respectively in prototype and operational systems; for 58% of such thresholds, a performance analysis has been carried out. The most adopted validation criterion is the compilation of a contingency matrix and the evaluation of performance indicators derived from that matrix. The contingency matrix is almost always computed as a 2 × 2 matrix, considering landslide and warning as dichotomous variables, neglecting both the different warning levels that can be issued by a LEWS and the multiple landslides that can occur simultaneously. Piciullo et al. (2018) and Pecoraro et al. (2019) show, respectively, that the majority of Te-and Lo-LEWS employ more than two warning levels (usually 4). In this circumstance, a performance analysis considering a 2 × 2 contingency table can lead to incomplete or wrong performance evaluations. To solve this issue, Calvello and Piciullo (2016) proposed a method, called EDuMaP, for the performance analysis of a warning model, based on the computation of a duration matrix, to be used in place of a contingency matrix. Performance criteria and different performance indicators are applied to the computed duration matrix to evaluate the performance of the warning model.
Indicators used for rainfall threshold validation and performance evaluation of Te-LEWS Piciullo et al. (2018) and Segoni et al. (2018a) show that the contingency matrix is the most used method for both rainfall threshold validation and performance evaluation of Te-LEWS. The performance indicators that can be derived from a contingency matrix are many. Table 1 summarizes the indicators employed by at least two authors for either rainfall threshold validation or performance evaluation of Te-LEWS. Indicators employing the same formula are grouped together, providing the different names used in literature, and the related references. The formulas have been homogenized, for comparative purposes, adopting the following terms to define the four elements of the contingency table:  correct alert, CA; true negative, TN; false alert, FA; missed alert,  MA. The efficiency index_(1), also called critical success index or threat score, differs from the efficiency index_(2) because it does not consider TN. Therefore, the values of these two indicators can be considerably different. The same is true for the odds ratio, which evaluates the ratio between positive and negative predictions and can be computed with or without considering TN (respectively identified as odds ratio_(2) and odds ratio_(1) in Table 1). In the validation process of rainfall thresholds, as well as in the performance assessment of LEWS, the number of TN, which represent the absence of both warnings or landslides, is typically orders of magnitude higher than other terms of the contingency table. Thus, considering TN in the performance indicator can lead to an overestimation of the (computed) efficiency of the system. For this reason, the efficiency index and the odds ratio computed without TN are to be preferred in validation and performance analyses.
The efficiency index_(1) and the odds ratio_(1) are related by the following expression: 1/EI -1/OR = 1, so it could be sufficient to select one of them in performance analyses. Furthermore, the hit rate and the missed alert rate are complementary, as well as positive predictive power and false missed alert rate. Among the indicators used to quantify errors, it is worth mentioning the missed and false alerts balance, which defines the percentage of MA among the erroneous predictions and thus it ranges between 0 and 1. From the perspective of reducing the number of MA, which may cause higher negative consequences compared with FA, missed and false alerts balance values should be as low as possible.
The considerations above have led to the selection of 2 main performance indicators, for the alert classification criterion (criterion A in the following): (i) efficiency index_(1) and (ii) missed and false alerts balance.
A tool for the application of the EDuMaP method The EDuMaP is a method for the performance analysis of a warning model, based on the computation of a duration matrix, to be used in place of a contingency matrix. Performance criteria and different performance indicators are applied to the computed duration matrix to evaluate the performance of the warning model. The model is fully described in Calvello and Piciullo (2016). An Excel tool for the application of the EDuMaP method has been recently programmed in Visual Basic for Applications. The Excel spreadsheet comprises an initial "home" page and some other tabs. The left side of the home page is set to define the input data for the performance analysis and to run different subroutines, following the main structure of the EDuMaP method (Fig. 1). The right side of the home page presents the chosen performance criteria, the computed duration matrix, and the final results of the analysis in terms of performance indicators. The values of the 10 input parameters (i.e., warning levels, landslide density criterion, lead time, landslide typology, minimum interval between landslide events, over time, area of analysis, spatial discretization adopted for warnings, time frame of analysis, temporal discretization of analysis), as well as the landslide and warning datasets for the period of analysis, are defined in separate tabs. Once the datasets are inserted, it is possible to generate landslide and warning events, i.e., to group landslides and warnings on the basis of the values of the input parameters. After that, the element value of the duration matrix, d ij , can be computed. Then, two sets of performance criteria need to be defined. Figure 1 reports the two performance criteria that will be used for the analysis presented in this paper. They are named, respectively, alert classification (criterion A) and grade of correctness (criterion B). The first criterion (A) employs an alert classification that groups together some elements of the matrix to identify: correct alerts, CA; false alerts, FA; missed alerts, MA; true negatives, TN. The second criterion (B) assigns a color code to the elements of the matrix in relation to their grade of correctness, herein classified in four classes as follows: green, Gre, for the elements which are assumed to be representative of the best model response; yellow, Yel, for elements representative of minor model errors; red, Red, for elements representative of significant model errors; purple, Pur, for elements representative of the worst model errors. Once the two performance criteria are defined, the performance indicators can be computed, and the results are shown in both tabular and graphical formats.
The performance indicators employed in the Excel tool and, adopted in this paper, are a revised and reduced version of what has been proposed by Calvello and Piciullo (2016) and Piciullo et al. (2017a). They refer both to alert classification criterion (A), grade of correctness criterion (B), and a mix of the two (A+B). The indicators adopted herein for criterion A have been discussed in the "Indicators used for rainfall threshold validation and performance evaluation of Te-LEWS" section. For the reasons described in that section, the computation of all the performance indicators does not include the element d 11 , which represent the amount of time associated with the simultaneous absence of warning and landslide events. Table 2 shows the indicators used, their formulas, and the reference to the performance criterion considered.

Case study
Piemonte "SMART" LEWS Arpa Piemonte (the Regional Agency of Environmental Protection of Piemonte) developed its first shallow landslides early warning system in 2008 (Tiranti and Rabuffetti 2010). The LEWS, called SMART (Shallow landslides Movement Announced through Rainfall Thresholds), are based on an empirical intensity-duration (ID) model where the thresholds have been identified by back analysis, considering the relationship between historical widespread shallow landslide events that occurred in 1990 and 2002 and rainfall data recorded by the regional rain gauge network (more than 400 rain gauges distributed over an area of 25,873 km 2 ).
SMART operates both in real-time and in forecasting mode, coherently with the setup of the Regional Warning System for Geohydrological and Flooding Risk in Piemonte (RWS) (Rabuffetti et al. 2003;Rabuffetti and Barbero 2005). SMART operates in two macro-zones of the Piemonte region, called "homogeneous zones": Alps and Apennine; hilly environment including Tertiary Piedmont Basin (TPB) and Torino Hill (Fig. 2a). The two zones are characterized by two different sets of thresholds (Eqs. 1 and 2), developed considering the rain gauge locations within the two zones.  (1)  Landslides 17 & (2020) 2535 Thresholds become operative for rainfall duration exceeding 12 h. Indeed, for rainfall lasting less than 12 h, thresholds overpassing may indicate the probability of occurrence of other phenomena commonly triggered by short and intense rainstorms, such as accelerated soil erosion due to widespread surface runoff or channelized debris flows in small Alpine catchments. Intersection between the two homogeneous zones of Fig. 2a and the 11 warning zones of the RWS produces warning zones (Fig. 2b) for the prediction of shallow landslides.
SMART does not employ a probabilistic approach, and therefore, an issued warning has the same degree of severity whether

Original Paper
Landslides 17 & (2020) the threshold value is just reached or whether it is exceeded by a considerable amount. However, three levels of warnings are defined, based on an indirect estimation of the expected landslides, estimated as a function of the number of rain gauges for which the rainfall threshold is exceeded in real-time or in forecasting mode.
In addition to the no-warning condition, corresponding to a negligible probability of shallow landslide occurrences, the other warning levels are as follows: (1) yellow (isolated triggering of shallow landslides); (2) orange (diffuse but not widespread triggering of shallow landslides, equivalent to less than 10 landslides in a warning area); (3) red (widespread triggering of shallow landslides, equivalent to more than 10 landslides in a warning area).

Performance analysis
The performance of the warning model employed in the SMART has been evaluated adopting the EduMaP method (Calvello and Piciullo 2016) using the Excel tool described in the "A tool for the application of the EDuMaP method" section. The analysis was performed considering the values of the 10 input parameters reported in Table 3. Landslide events (LE) are defined, according to Calvello and Piciullo (2016), as a series of landslides grouped together based on their spatial and temporal characteristics. The performance assessment was conducted considering the landslide events (LE) and the warning events (WE) registered in Piemonte between January 2008 and December 2016 (Table 4) in 10 warning zones (from A to L). One warning area (M, see Fig. 2) was not considered since no landslides occurred in that area during the period of analysis. Figure 3a and b show the results obtained for the ten warning areas, reporting the number of elements of the 10 durations matrices for the two performance criteria reported in Fig. 1, i.e., the alert classification criterion-herein called criterion A-and the grade of correctness criterion-herein called criterion B. The time unit considered in the duration matrix is day, consistently with the temporal discretization available for the considered data sets (Δt = 1 day). Therefore, considering the time frame of the analysis (ΔT = 9 years), the total number of elements for each duration matrix is 3287 days. Figure 4a and b show the results in terms of performance indicators for the ten warning zones. Comparing the efficiency indexes (EI_A, EI_B, and EI_A+B), the higher values are reached for EI_B, due to the significant number of Yel elements observed in all the warning zones. This means that most of the MA and FA observed in the period of analysis are associated with minor errors of the model. The results provided by these three indicators generally agree in pointing out that the best-performing models are those adopted for zones A and C. It is worth mentioning that in 6 cases out of 10, the EI_A is lower than 50%, indicating that the sum of MA and FA is higher than the number of CA (especially for zones I and L).
Among the error indicators, the probability of serious missed alerts indicator (P SM-MA ) is higher than 10% in 6 cases out of 10, pointing out that the majority of severe model errors are related to missed alert of very large landslide events. This can be explained, as discussed in Stoffel et al. (2014), considering that temperature changes cause important modifications of the slopes' hydrological cycle, as well as of the precipitation type and behavior, such as shortening of snow cover persistence during spring. The accelerated snowmelt contributes significantly to the triggering of shallow landslides, also in presence of spring rainfall of moderate intensity, because water deriving from snow melting completely infiltrates the ground. On the other hand, although the probability of serious false alerts indicator (P SM-FA ) is equal to zero in 7 cases out of 10, in the remaining 3 cases, more than 30% of the FA are Pur errors. Regarding the missed and false alert balance (MFB), which represents the ratio of MA over the sum of MA and FA, Piciullo et al. (2017a) recommended values lower than 0.25 for operational Te-LEWS to be considered efficient (i.e., the duration of MA should be less than one-third of the duration of FA). This condition is respected only in 1 case out of 10, while MFB is equal to 1 for warning zones E, I, and L. Figure 5 reports a detailed analysis on the grade of severity of MA and FA and the grade of correctness of CA, distinguishing respectively among Pur, Red, Yel and Gre, Yel.
In all the warning zones, some LE are missed in the period of analysis and in 7 cases out of 10, several Pur errors occur (i.e., a LE of class 4 missed). However, in almost all the cases (8 out of 10), most errors are Yel errors. The exceptions are represented by G and L, the two warning zones characterized by the highest numbers of missed alerts (15 and 12, respectively). The presence of a significant number of Pur and Red is probably due to adopted rainfall thresholds that are inadequately high for these warning zones. The number of FA is generally lower than the number of MA (except for zone H). Besides, in 3 cases out of 10, only MA and no FA are observed in the period of analysis. It should also be mentioned that, when FA occurs, most of them are characterized by Pur and Red errors, revealing that in many cases, warning level (WL) 3 and 4 were issued without the occurrence of large LE. Finally, the warning model was able to correctly predict the occurrence of several landslide events in all the warning zones, especially in A, B, and C. However, as already noted, a relatively slight number of correct alerts are associated with the best performance of the model (i.e., Gre elements) and in three warning zones (E, H, and L), only Yel elements were observed.

Discussion
Metrics of success and error for Te-LEWS Different performance indicators are available in the scientific literature for rainfall threshold validation and performance of LEWS (see the "Performance assessment of territorial landslide early warning systems" section). The following three indicators (see Table 2) are herein used for the comparison of the performance of the SMART model with different models adopted in other Te-LEWS: efficiency index_(1), performance criterion A, (EI_A); and missed and false alerts balance (MFB). These values are discussed in relation to the values provided in the literature by different authors.
In the LEWS operating in Hong Kong, two warning models currently coexist. Indeed, a SWIRLS Landslip Alert (SLA) model was developed and added to the system to provide some lead time (up to 3 h) to the warnings before the standard landslip warning criteria are exceeded. The SLA model considers the rolling 21 h of measured rainfall plus a 3-h rainfall forecast, whereas the standard landslip warning model is based on the measured 24-h accumulated rainfall. The performance for the period 2001-2004 of both models has been reported in Cheung et al. (2006). The EI_A of the models resulted, respectively, equal to 61% and 78%, for the SLA and the landslip warning models. Based on these values, the author stated that both the SLA and landslip warning were found to be generally effective. To compare the performance of these models with the performance evaluation carried out in this manuscript for the SMART system, the data provided in the paper by Cheung et al. (2006) have been used to compute the missed and false alert balance. The SLA and landslip warning models in the period 2001-2004 showed MFB values equal to 33% and zero.
In Restrepo et al. (2008), a performance analysis of the prototype debris flow warning system for recently burned areas in Southern California has been carried out for the winter of 2005/ 2006 (first year of operation). In this case, the probability of detection (92%) and the false alert rate (72%) (see Table 1) have been evaluated. Considering the same database, for comparative purposes, the EI_A and missed alert balance have been computed. The values are quite low for the success indicator EI_A: 28%. The    Fig. 6. The performance analysis of the SIGMA model, employed in the LEWS operational in the Emilia Romagna region, Italy, and described in Martelloni et al. (2012), and Lagomarsino et al. (2013), reports very high values for the odds ratio and the efficiency index, which prompt the authors of those papers to highlight the very good predictive power of the model. However, those performance indicators have been computed including TN, and they are significantly influenced by the very high number of TN in the period of analysis. The analysis performed for the SMART system, purposefully exclude TN elements from the computation of the performance indicators, for the reasons described in the "Indicators used for rainfall threshold validation and performance evaluation of Te-LEWS" section. For comparative purposes, the data provided in the papers by Martelloni et al. (2012), and Lagomarsino et al. (2013), have been used to compute new values of the performance indicators not including TN. The results for the three selected indicators are, respectively, for the two papers: EI_A equal to 15.9%, 14%; 0.17; missed and false alert balance equal to 6.9%, 5%. Since the values of the success indicators are quite low, also in this case the indicators have not been included in the comparison (Fig. 6). However, some updates to increase the performance of SIGMA model have been recently realized and published in Segoni et al. (2018b). Calvello and Piciullo (2016) reported the first application of the EDuMaP method. They applied it to the municipal early warning system operating in Rio de Janeiro, Brazil, for which they carried out a parametric analysis. They also presented a list of indicators for the performance evaluation of LEWS. Among them, the efficiency index was evaluated in the same way of EI_A of Table 1 (considering how criterion A was applied); thus, they are directly comparable with the results obtained for the SMART model. The two performance indicators were evaluated for two warning zones (out of 4) of the municipality: baia de Guanabara and zona Sul. In these two zones, the authors report values of EI_A equal to 75% (baia de Guanabara) and 66% (zona Sul). The MFB, herein calculated considering the data provided in the paper by Calvello and Piciullo (2016), has the following values: 14.5% for baia de Ganabara and 3.4% for zona Sul.
To compare the performance evaluation of the SMART model with the literature case studies previously mentioned, a radar chart is used (Fig. 6) (Piciullo et al. 2017a). Figure 6 clearly shows that the application of the performance criterion A is the most conservative (see blue markers) and that criterion B is the one providing the highest values of the indicators (see red markers). Zones A, B, C, E, F, and G have high values of EI_A compared with the references from the literature. Piciullo et al. (2017a) recommended a value of MFB lower than 25% for considering efficient a warning model within an operational Te-LEWS (i.e., only one wrong alert out of four is a MA). This condition is respected for the 3 zones out of 10 in our analyses: B, C, H. This comparison shows that the SMART model employed in zone C is giving the best performance. On the contrary, zones L and I should be considered for thresholds improvement, since their performance is quite poor.

Comparison with simpler validation techniques
The validation of the thresholds used in the SMART LEWS was conducted in 2008 using a 2 × 2 contingency table (Tiranti and Rabuffetti 2010), evaluating the joint distribution of "yes"/"no" and "landslide forecast"/"occurrence." The performance evaluation was conducted considering the whole set of widespread shallow landslide events that occurred between 1990 and 2002. The indicators considered for the analysis were hit rate (HR), false alert rate (FAR), and efficiency index_(1) (EI). Formulas are reported in Table 1. The results showed the following values for the three analyzed indicators: HR = 0.83, FAR = 0.45, and EI = 0.49 (Tiranti and Rabuffetti 2010). Among all the single landslides that occurred in the period of analysis, 83% has been correctly predicted. Yet, the high number of false alerts produced a rather low value of EI and a high value of FAR. As already mentioned, when employing a 2 × 2 contingency table for the performance evaluation of LEWS, it is not possible neither to distinguish among different warning levels nor considering the number of landslides. Consequently, it is not possible to identify the warning levels that cause false alerts. Usually, when a low warning level is issued, one or few landslides are expected. However, it is not always possible to record all the landslides that occur in a warning zone, since the area covered is always very wide and the places that are not urbanized are significant. Consequently, one should judge with care the assessment of the non-occurrence of one or few landslides when a low warning level has been issued. Table 5 reports the results of a performance evaluation of the SMART model conducted adopting the same validation technique (i.e., 2 × 2 contingency table) adopted in Tiranti and Rabuffetti (2010), using the database described in the "Performance analysis" section, for the period 2008-2016. Two different comparisons can be derived from the results of this new analysis: (i) comparison of the performance of the SMART model in two different time periods, using a simple validation technique, and (ii) comparison of the results obtained conducting the performance assessment in two different ways, i.e., by employing the EDuMaP method and a simpler validation technique. Concerning the first issue, i.e., comparison of the performance of the SMART model in two different time periods, the results clearly highlight a decrease of EI, whose values change from 0.49 (period 1990-2002) to 0.14 (period 2008-2016), demonstrating that the general accuracy of the SMART model significantly decreased in a relatively short period of time. This is also confirmed by the low value of HR (0.28), mainly due to the relevant number of missed landslide events (70 out of 97). Besides, an increasing number of false alarms can be also observed, as the value of FAR is equal to 0.79. The very different performance of the SMART model in the two periods could be associated with the non-stationarity of the rainfall characteristics in the two periods. The shallow landslide events' behavior has changed between 1960 and 2016 in Piemonte, as shown by the data  reported in Table 6. Landslide events until around 2000 were characterized by an average return period of about 5 years, high number of phenomena (from 1000 to more than 10,000) during a single event, and a higher frequency of occurrence in the fall season (September-November). After the year 2000, the frequency of the landslide events has increased (about one event per year), the main season of occurrence became spring (March-June), and the events are typically characterized by a lower number of landslides (from 50 to about 2000), as already reported by Stoffel et al. (2014) and subsequently updated by Tiranti et al. (2019). All that considered, the performance of the SMART model has also been most likely influenced by the significant changes of the weather pattern that have been occurring in the area in a relatively short time (Cremonini and Tiranti 2018). In fact, the SMART model was calibrated considering landslide events that occurred between 1990 and 2002, thus practically using landslide data before the recorded (almost abrupt) change in behavior of the temporal and spatial distribution of widespread shallow landslide events.
About the second issue, i.e., comparison of results obtained conducting the performance assessment in two different ways, the performance computed with the simpler validation technique is generally poor for all the warning zones (especially for E, H, and L), as highlighted by the very low values of EI (Table 5). Looking at HR, it is worth mentioning that in almost all the cases (apart from B and C) more than half of the occurred landslides were not forecasted by the model. Besides, the high values of FAR suggest that for all the warning zones, most of the warnings issued are false alarms. On the other hand, the performance evaluation carried out with the EDuMaP method highlighted a relatively good model performance in several warning zones (especially in A, B, and C). This can be explained considering that the EDuMaP method allows for a more detailed analysis on the severity of the errors and the correctness of the predictions. The performance analyses, carried out with the two methods, also indicate different warning zones as the best-performing ones: A and C using the EDuMaP method; D using the simpler validation technique. This difference can be related again to the possibility of a more detailed  Fig. 6 a, b Comparison between the performance indicators of the SMART model for all the 10 warning zones of the Piemonte region, with literature case studies. The computed values of the three versions of the efficiency index (EI_A, EI_B, EI_A+B) are compared with the following values: 61% (SLA from Cheung, 2006); 78% (landslip warning from Cheung, 2006); and 51% (Staley et al. 2013). The computed values of the missed and false alert balance (MFB) are compared with the following values: 33% (SLA from Cheung, 2006) and 25% (Piciullo et al. 2017a) assessment of the model performance when the EDuMaP method is used. In this case, this highlights that the large majority of MA and FA, in some warning zones, are not severe errors of the warning model.

Conclusions
The performance evaluation of LEWS is often overlooked; however, different indicators are available in the literature and can be employed for this task. These indicators have been homogenized and proposed in the "Indicators used for rainfall threshold validation and performance evaluation of Te-LEWS" section. Few of them have been judged by the authors to be essential for describing the performance of a LEWS (see Table 2). The most important indicators that can give a general overview of the system performance are the efficiency index (EI) and the missed and false alert balance (MFB). The first can be considered to evaluate the general success rate of a Te-LEWS; the latter can be used to evaluate the percentage of missed alerts among the wrong predictions (sum of false and missed alerts). Then, to have a more detailed understanding on the severity of the missed and false alerts (i.e., wrong predictions that belong to the purple cells), it is relevant to evaluate and analyze the probability of serious missed alerts and probability of serious false alerts (P SM-FA, P SM-MA ). They quantify, respectively, the percentage of the serious no-warning mistakes (i.e., missed alerts of a high LE class) and of serious no landslides mistakes (i.e., false alerts with high levels of warning issued).
According to the results of these four indicators, it is possible to fully evaluate the system performance and to identify the warning levels and, consequently, the thresholds that need to be refined. Concerning the use of the efficiency index to evaluate different criteria (A, B, A+B), it is possible to state that the values of the EI for the criteria A and B correspond, respectively, to a lower and an upper bound (Fig. 6a). The use of EI for the combined criterion (A+B) is, however, to be preferred for the performance analysis. By comparing the performance of different Te-LEWS, it is possible to state that a system should have a EI higher than 60%. However, EI (A+B) in an efficient system should exceed 80%, as it is the case for zones A and C in our application. Concerning MFB, it should not be greater than 20%, better if its value is lower than 10%. For evaluating the indicators, the element d 11 (i.e., the TN values) of the duration matrix has been purposefully neglected, to avoid an overestimation of the performance (see the "Metrics of success and error for Te-LEWS" section).
The definition of the landslide events deserves some remarks. It is influenced by a series of choices the analyst needs to make in selecting and grouping landslides (Calvello and Piciullo 2016). The definition of limit values to differentiate among k classes of landslide events (see Table 3) has been discussed at length with the SMART system managers. Standard or commonly used procedures do not exist in literature since the classification of landslide events varies, as it should, as a function of the LEWS under investigation. Indeed, this classification depends on how the warning levels, and their thresholds, have been defined, as well as on the expected number of landslides associated with each warning level. For these reasons, it is of great importance that this parameter, as well as the definition of the performance criteria (see the "A tool for the application of the EDuMaP method" section), is defined by the analyst in accordance with the system managers. A parametric analysis carried out in Piciullo et al. (2017b) shows how the performance evaluation can differ as a function of the landslide criterion and how its definition is a crucial point to obtain a correct performance evaluation of the warning model.
Performance assessment of Te-LEWS is a fundamental issue to run an efficient warning model. Often the performance analysis of Te-LEWS is carried out considering a 2 × 2 contingency matrix. Yet, with this method, it is not possible to differentiate among different warning levels and number of occurred landslides in a given time interval. For instance, a missed alert of just 1 landslide is judged in the same way of a missed alert of many landslides. Moreover, the error associated with the highest level of warning issued when no landslides occurred is judged in the same way as the one associated with any lower level of warning issued with no landslides. To overcome these issues, an advanced method for the performance evaluation of LEWS should be used.
An Excel-based tool (freely available on request), programmed in VBA, has been recently released to increase and speed the applicability. In this paper, EDuMaP has been employed, using the Excel tool, to the SMART warning model operational in the LEWS of Piemonte region, Italy. The results highlight that the SMART model has a good performance in some warning zones: A, B, and C (see Fig. 4). Detailed insights emerge by analyzing the results of the performance evaluation carried out with this method (see Figs. 4 and 5 and Table 6). The same considerations and analyses could not be carried out with simpler methods (see the "Comparison with simpler validation techniques" section). For instance, the EDuMaP method allows for a more detailed assessment of the seriousness of the errors and of the correctness of the predictions. In the specific case of LEWS operating in Piemonte, the EDuMaP method highlighted that the large majority of MA and FA, in some warning zones, were not severe errors. As expected, the warning zones showing the highest performance differ when different performance evaluations are carried out: A and C using the EDuMaP method; D using the simpler validation technique (see the "Comparison with simpler validation techniques" section). Finally, it is worth mentioning that, after being operational for almost 20 years, the SMART model will be soon replaced by a new model, named SLOPS (Tiranti et al. 2019), that upgrades some weakness aspects of the previous model. In the near future, the performance of the SLOPS model in the early prediction of landslides will be evaluated with an advanced performance model and compared with the SMART model.