Introduction

Landslides are extremely widespread in the Italian territory, and they are, along with floods, the most frequent natural hazard, causing the greatest number of losses of human lives and damages to properties and infrastructures (Guzzetti 2000). Over the time, there has been an increase of the risk of landslide due to increasing anthropization of the territory (ISPRA 2020a) even near to unstable area. In Italy, the estimate annual losses caused by landslide are 3.9 billion Euros (Klose et al. 2016).

Landslide research chiefly relies on landslide inventories for a multitude of spatial, temporal or process analysis (Van Den Eeckhaut and Hervás 2012; Kirschbaum et al. 2015; Klose et al. 2015). These inventories can be created with several methods as photo-interpretation, field surveys (Brunsden 1985) and remote sensing (Soeters and Van Westen 1996; McKean and Roering 2003; Lu et al. 2012; Bianchini et al. 2018; Solari et al. 2020) or retrieval of data from technical reports and/or newspapers (Kirschbaum et al. 2010; Görüm and Fidan 2021; Guzzetti et al. 2008; Klimeš et al. 2017; Vennari et al. 2014; Rosi et al. 2019) or a combination of them (Dikau et al. 1996; Rosi et al. 2012; Rosser et al. 2017).

All these traditional approaches are usually quite accurate, but they are also very demanding tasks, even at detailed scales (Brunsden 1985; Santangelo et al. 2010), hence time consuming.

Mass media is generally the first and primary source of information about hazards for the public (Fischer 1994). Literature studies indicate that the social sensors in terms of tweets and other social media websites report a natural disaster much faster than the observatories (Goswami et al. 2016). The data generated on social media provide a unique opportunity to capture disaster situations with a relatively high temporal and spatial resolution to map different events across various locations (Fan et al. 2018; Rachunok et al. 2019; Saltelli et al. 2020).There are several ways of using social media in disaster management, including data collection, analytic workflow, narrative construction, disaster relevant information extraction, geo-localization pattern/text/image analytics and the broadcasting of information through social media platforms (Carley et al. 2016). Because social media is widely used for various purpose, vast amounts of user-generated data exist and can be made available for data mining (Gundecha and Liu 2012). The data mining research has successfully produced numerous methods, tools and algorithms for handling large amounts of data to solve real word problems (Gundecha and Liu 2012). In recent years, artificial neural networks (ANNs) (Asheghi et al. 2020), support-vector machines (SVM), and decision tree have been used extensively as the data mining models (Goswami et al. 2016). Systems using automated or real-time updates are still uncommon and only used for some types of natural hazards (Battistini et al. 2013, Battistini et al. 2017; Calvello and Pecoraro 2018), mainly earthquakes, floods and wildfire, while creating a complete and updated database is more difficult for landslide (Galli et al. 2008; Santangelo et al. 2010). The methodology of Battistini et al. (2013, 2017) and Kreuzer and Damm (2020) allows to update in near real time the landslide database using the data mining technique inside online newspaper articles.

Newspaper articles can represent a relevant source of data for landslide scholars, and several authors used them to collect info about landslide events. Early works (Guzzetti et al. 1994; Cuesta et al. 1999; Devoli et al. 2007) were based on the manual search and collection of newspaper articles, while more recent works (Kirschbaum et al. 2010, 2015; Taylor et al. 2015; Klimeš et al. 2017; Görüm and Fidan 2021) used automated procedure to identify landslide-related news.

For example, Kirschbaum et al. (2010, 2015) and Klimeš et al. (2017) made use of Google Alert associated with proper keywords, to identify the news items, and Taylor et al. (2015) used a set of Boolean search terms to query the Nexis UK newspaper archive.

Even if the news items have been gathered by automated procedure, the literature review revealed that the collected data usually have been manually analysed to identify the landslide location and the date of the event.

The newspaper articles used in this work have been harvested by a data mining algorithm named SECaGN (Semantic Engine to Classify and Geotagging News, Battistini et al. 2013). The data mining takes place within Google News as it considers with more completeness national and local newspapers. The identified data are automatically dated, located and arranged by the system and filed in a geodatabase in near real time. This source of information allows continuous feedback from real world, and the news related to landslides can be rapidly collected (the system is set to scan Google news every 15 min) and used in extremely shorter times compared to traditional techniques (Battistini et al. 2017). In addition, it allows to define a more complete landslide database, even considering events with less social impact and catastrophic effects.

The objective of this work is to get the spatial and temporal distribution of landslides phenomena for the Italian context through online news harvested by SECaGN. To achieve this goal, the newspaper articles retrieved from online newspaper have been firstly validated and then classified into 3 classes: (i) article related to recent landslide events, where the landslide triggering date and its approximal location can be identified; (ii) article related to landslide but without information about the triggering date and a scarce location accuracy (province, region or geographical zone); and (iii) articles not related to landslides, which have been removed from the database. Landslide-related articles have been analysed to assess their spatial and temporal distribution and then compared with landslide hazard map and population living in landslide risk zones map. In this work, the possibility of using web data mining to create a landslide database over a large area has been explored, and it also resulted that this approach is not suitable for very detailed landslide inventories, since some technical data, as landslide type, volume or exact location, cannot be always retrieved from newspapers.

Study area

Italy is almost 300.000 km2, and it is divided into 107 provinces and 7926 municipalities, most of them affected by landslide hazard (Figure 1A). Much of Italy consists of hilly and mountainous terrain subject to landslides of different types and sizes (Guzzetti 2000). Nowadays, in the IFFI database (which is the Italian Inventory of Landslide), 470.000 landslides are reported, involving an area of 20.000 km2, representing the 6.6% of the national territory. The most common movement types are rotational and translational (slump and slide), debris flow and complex (as defined in Cruden and Varnes 1996).

Figure 1
figure 1

A Regions of Italy and B digital elevation model with the names of cited cities

The main relieves are the Alps, which span from East to West of the northern border of the country, and the Apennines, that cross the country from North to South (Figure 1B). In the alpine area, which is formed mainly of metamorphic rocks (Vai and Martini 2001; Salvatici et al. 2018), the most frequent phenomena are rock fall and debris avalanches (Agliardi and Crosta 2003; Panizza et al. 2011), while in the Apennines, which are formed mainly of arenaceous flysch (Vai and Martini 2001; Agostini et al. 2014; Rosi et al. 2018, 2021), the most common landslides are represented by rotational and translational landslides, both surficial and deep-seated.

The climate of Italy is mainly Mediterranean, with dry and warm summers and mild and wet winters; during winter, snowfall is frequent both on the Alps and on the Apennines, and the consequent snowmelt in the springtime often leads to the mobilization of landslides.

Material and method

Semantic Engine to Classify and Geotagging News (SECaGN)

SECaGN is an algorithm based on a mechanism of acquisition, management and publishing of online articles related to natural hazard (landslides, floods and earthquakes). It aims to get information about spatial and temporal distribution of the events. The automatic search for newspaper articles is performed combining primary words, synonyms, singular and plural forms (keywords) in Italian language related to the landslide argument. The data mining is applied inside Google News. After the acquisition process, a data filtering procedure is applied to separate non-relevant information from pertinent items. The data filtering takes place through the geotagging and the cataloguing of articles using three scores (Battistini et al. 2013):

  • Place score: a score value is assigned to evaluate the reliability of the geotag.

  • Event score: index of the probability that the news item actually concerns the topic event.

  • Time score: estimated days between the time of occurrence of the event and the time of publication of the article.

All the newspaper articles that reach a minimum score are then filed in a geodatabase and their location can be viewed in a dedicated WebGIS (Figure 2A). The whole process is repeated every 15 min.

Figure 2
figure 2

Workflow of the work: A data mining and geotagging procedure; B news analysis

This data mining methodology was calibrated and tested in Italy during a test period of 2 years (November 2009–November 2011). The process is completely automated and scalable. It can also be applied in other countries after a specific tuning of the keywords used by the data mining algorithm.

Manual supervision

The SECaGN algorithm identified 184322 newspaper articles about landslide events from 2010 to 2019. The retrieved articles refer to 32525 generical events or “news”.

It is to notice that each landslide event can be reported from 1 or more newspaper, based on its impact or on the relevance of the affected area; for example, small landslides involving a major road or an important city can have a vast media echo, while landslides involving minor roads or small villages are reported only by local newspapers.

In this way, the landslide event can be reported in several newspaper articles that are grouped in a single news, which hence refers to a specific landslide event.

Even if the SECaGN results were already tested in previous papers (Battistini et al. 2013, 2017), in this work, the news item underwent a manual verification and a classification based on their relevance, localization accuracy and time of publication. This classification (Figure 2B) allows to identify the most relevant news, in terms of temporal and spatial accuracy of landslide event identification.

For the classification, 3 classes have been defined (Figure 3, Table 1):

  1. Class 1:

    “Near real time news”. In this category, all the news referred to ongoing or very recent landslide events (same day or a couple of days before) are classified. These news are also characterized by a high level of spatial accuracy (at least the municipality must be identified), with an approximation of few kilometres. Some news, with high temporal precision but low spatial accuracy, have been manually modified (if possible) based on article text, to reach the required level of approximation. The news in this class can be used for further analyses or modelling (Battistini et al. 2017).

  2. Class 2:

    “News generically referred to landslides”. In this category, the news referred to past landslide with unknown triggering date (e.g. “the initiation (or finishing) of works aimed to risk reduction or to landslide remediation”) are stored. News with a low spatial accuracy (referred to provinces/cities or geographical areas) are classified in Class 2 as well. This kind of news is useful to identify those areas that have been affected by landslides in the past and for hazard/risk zoning.

  3. Class 3:

    “News not related to landslides”. News not related to the landslide argument but whose semantic association lead to a misclassification. After this work, these news have been removed from the database.

Figure 3
figure 3

General distribution of the used news in Italy

Table 1 Description of the 3 classes used to group the news.

The news classified in classes 1 and 2 have been then used to explore landslide distribution in Italy, both at region and province levels (in Italy, each region is divided into provinces, and each province is divided into several municipalities) as well as to explore the temporal distribution of the news.

Headline text analysis

The headlines of each article have been analysed using Natural Language Processing (NLP) technique (Liddy 2001).

NLP is a computerized approach for textual analysis, and it provides several techniques to model the textual data. In this work, the word frequency technique has been used with the scope of identifying the most common associations of words both for “good” and “bad” news. The results of this analysis can help to improve the data mining algorithm.

Results

From 2010 to 2019, 32525 news have been gathered by the used data mining algorithm. Among them, 13275 news had useful information about the geo-localization and the date of landslide event; 1400 news have been corrected, attributing a more appropriate localization based on the text into the article.

According to the adopted classification criteria, the identified news has been classified as follows:

  • Class 1: 13275 news (41%)

  • Class 2: 18603 news (57%)

  • Class 3: 647 news (2%)

This classification allowed to identify the “true news” (classes 1 and 2) and to reject the data not appropriate (class 3), reducing the data to be processed. About 41% of news reported information relative to recent landslide, and only a minimum percentage of the database is made up by wrong news (2%) (Figure 4A). A textual analysis has been conducted to retrieve the frequency of words inside the headlines. In Figure 4B, C and D, the most frequent words of the headlines of the classes 1, 2 and 3 news are reported, respectively. The term “landslide” is present in all categories as first word widely used; indeed, in the class 1, the word “landslide” is present 8021 times, 10457 times in class 2 and 271 times in class 3.

Figure 4
figure 4

A Overall landside news classification. B Words’ frequency in the headlines inside class 1. C Words’ frequency in the headlines as inside class 2. D Words’ frequency in the headlines inside class 3

After the word frequency analysis, the spatial distribution of the data was explored, as described below.

The used data mining algorithm cannot identify the exact location of a landslide, since it is not usually reported in newspapers; therefore, the data have been grouped on regional base (Figure 5A) and on provincial base (Figure 5B) to identify the areas with a higher number of landslide news. Class 2 news have been used only on regional scale aggregation since some of them do not provide an adequate localization accuracy for a more detailed analysis. According to the spatial distribution of the news, during last 10 years, 41.7% of the municipalities suffered at least one landslide.

Figure 5
figure 5

Spatial distribution of landslide news: A Regional aggregation with overall news (classes 1, 2); B province with only news about recent landslides (class 1). Genova is the province most affected by landslides, followed by Salerno, Messina, Savona and Sondrio. The Puglia region and the provinces along the North-East coast show a lower number of landslide events

The regions most involved by landslide are mainly in the norther area of the Country. Liguria and Lombardia are the regions with the highest number of news (classes 1 and 2) and therefore of articles publication (article referred to the same landslide event are grouped into a single “landslide news”). For example, Liguria has 36451 articles referring to 4318 landslide news (classes 1 and 2, Figure 5A); among them, 19844 articles refer to 1174 recent “landslide events” (class 1, Figure 6A), and in particular Genova is the most affected province by landslides (Figure 5B).

Figure 6
figure 6

Comparison between the number of published articles and of landslide events. A Regional distribution, B temporal distribution. In both the panels, histogram represents the distribution of published articles about recent landslides (class 1), and the black and orange lines represent the number of landslide events; both of them are referred to class 1 data

Besides the alpine area, several other provinces over the country showed a relevant number of news (Salerno, Messina, Savona, Sondrio), and they are mainly located along the western coast (Tyrrhenian seacoast) and along the Apennines mountain belt (Figure 5B), which is historically affected by landslides, because of its geological origin and the high frequency of clayey slopes .

The Puglia region (Figure 5A) and the provinces along the North-East coast (Figure 5B) show a lower number of landslide news, because they are mainly flat areas and less landslides are obviously expected (Figure 6A), as well as the Southern part of Lombardia and Veneto, and the North-Eastern part of Emilia-Romagna region.

Figure 6A shows the distribution of only class 1 news (referred to recent landslide events) at regional scale; also in this case, Liguria is the region with the highest number both of articles and landslide events. Lombardia is the second region, regarding the number of landslide events, but with a lower number of articles, while Sicilia and Toscana are the second and the third region, respectively, in terms of published articles, even if with a lower number of landslide events.

From a temporal point of view (Figure 6B), the year with the highest number of landslide-related articles (blue bars) is the 2014, while the number of landslide events (orange line) showed a very sharp increase from 2017 (1243 events) to 2019 (2901 events).

Once a general overview of spatial and temporal distribution of news has been accomplished, a more detailed analysis about only class 1 news has been carried out.

Figure 7A displays a monthly distribution of the landslide events identified by the class 1 data; it shows that November, March and February are the months more involved by landslides.

Figure 7
figure 7

Temporal distribution of class 1 news. A Monthly distribution of “landslides events”; B the number of days with at least 1 landslide reported from 2010 to 2019

Indeed, November, in 10 years, reported 2093 landslide events with 20142 published articles (multiple articles can refer to the same landslide event, as described in the previous section), while July, June and September are the months with less events. For instance, in July 597 landslide events were reported by newspapers.

Class 1 news has been further analysed to identify the number of days with at least 1 landslide reported (Figure 7B).

The annual distribution (Figure 7B) follows a gradual increase of days with at least 1 landside from 2015 to 2019; in this period, 8103 landslide events have been collected, distributed over 1378 days, with an average of almost 5 landslides each day (Figure 7B), while from 2010 to 2014, 5172 landslide events, distributed over 1236 days, were reported.

The number of days with at least 1 reported landslide event (landslide day) is higher the northern regions rather than in the southern ones, except for Sicily, the southernmost region, where a high number of landslide days is present (Figure 8A). Overall, 5 regions out of 20 had at least 450 days with landslide events, in the analysed period. Lombardia, Liguria, Campania, Sicilia and Toscana are the regions with the highest number of days characterized by landslides. In particular,677 days with landslides have been identified in Lombardia, 572 in Liguria, 545 in Campania, 475 in Sicilia and 451 in Toscana (Figure 8A). The Puglia region has the lowest number of landslide days: in this region, 72 landslide events, distributed over 49 days, are present.

Figure 8
figure 8

Spatial distribution of days with reported landslides. A Regional distribution. B Provincial distribution

In a more detailed scale (Figure 8B), 4 provinces out of 107 have a high number of days with landslide events (180–301), while the average value is 23 days with landslides every year. For example, the Genova province is characterized by 915 landslide events, reported in 12942 articles, distributed over 301 days. The provinces that have less days with at least one landslide event are located along the North-East coast of Venezia, Rovigo, Ferrara and Ravenna.

In general, results show that Liguria, Lombardia, Campania, Toscana and Sicilia are the regions with the highest number of both “landslide events” and “landslide days”.

Comparison with existing datasets

In order to validate the quality of the results, mainly of the spatial distribution of landslide events, a comparison with existing datasets about landslides has been made. The landslide hazard map of Italy (Trigila et al. 2018) and the map of population living in landslide-risk areas (Trigila et al. 2018) have been used.

These 2 maps have been processed to extract the percentage of area of each region affected by landslide hazard (Fig. 9B) and to calculate the percentage of population of each region living in zones affected by landslide risk (Fig. 9C). This operation was needed to account the differences in size and population of the different regions, which can vary greatly. Furthermore, some large regions (e.g., Lombardia, Veneto, Emilia-Romagna) are characterized by wide plain areas, and this will result in low percentages of territory affected by landslide hazard. The use of population at risk (as percentage of total regional population) was aimed to overcome this problem.

Figure 9
figure 9

comparison between the distribution of landslide news (classes 1 and 2, A), landslide hazard (B) and people at risk (C)

The comparison between the three maps in Figure 9 shows a good agreement between the distributions of landslide news, landslide hazard or people at risk for several regions, even if some anomalies can be identified. For instance, Valle d’Aosta shows a lower number of landslide news but a very high portion of the territory subject to landslide hazard (94%), or Lombardia has a high number of news and a low percentage of its territory subject to landslide hazard.

Then, the number of landslide events has been correlated with the aforementioned percentages to better verify the existence of a correspondence between these variables (landslide events, landslide hazard and population at risk). As shown in Figure 10A, there is a general correlation between the number of news (classes 1 + 2) and the areas affected by landslide hazard for each region as well as with the population living at risk. The distribution of the data shows some anomalies that are due to the morphology of the territory and the size of the regions. As stated above, some large regions (Emilia-Romagna, Lombardia, Piemonte, Veneto and Sicilia) are characterized by large plain areas, so the percentage of hazardous zones (for landslides) is low, but not neglectable; this leads to the higher news/hazard ratio than in other regions as Liguria, Toscana or Trentino Alto Adige, where there are few plain areas.

Figure 10
figure 10

Correlation between the number of landslide news (classes 1 + 2 and the percentage of landslide hazard area (A) and the percentage of people at risk (B)

Discussion

In this study, “landslides news” in Italy, automatically retrieved from web sources from 2010 to 2019, have been used to create a landslide database, which has been analysed to evaluate the spatial and temporal distribution of landslide events.

For the analysis, only newspaper articles reported inside the Google News aggregator have been considered, because it collects national and local newspapers, offering a better coverage of the data, and hence a better completeness in the creation of the landslides database.

Over 40% of news reported useful information (geo-localization, date) about recent landslides (class 1), while 57% of news can be used to identify an area involved by a landslide, but not the date of triggering (class 2); both of them can be useful to analyse landslide events distribution and hence for landslide hazard estimation.

In 10 years, in Italy, 184,322 articles related to landslides have been released by online newspapers; among them, 78550 articles referred to 13275 recent landslide events (class 1).

A textual analysis was conducted to get the frequency of words within headlines. Inside of the class 1 news, the majority of words refer to synonyms of the word “landslide”; within class 2, the most widely used words are referred to hazard, alert, weather forecast or anyway to past or future events without useful information about recent landslide events, while the words inside the class 3 are the result of a wrong association of word. However, in both categories (classes 1, 2, 3), the word that most widely used is “frana”, which is the Italian word for landslide, but it also has several figurative meanings, not referred to landslide events, that are often used in different contexts as sport or politics. The other commonly used word is “road”; this word is present in classes 1 and 2 with a high frequency, since landslides involving the road lead to a higher media coverage (high number of class 1 news) than landslides involving inhabited areas (e.g. forest), as well as the remediation work needed to restore the damaged roads (high number of class 2 news).

The landslide database analysis allowed to define a spatial and temporal distribution of landslide events. Considering only class 1 articles, the events are mainly present along the Alps and the Apennines.

The regions with more “landslide events” are mainly located in the northern part of Italy where the geological, geomorphological and climatic context of the Alps, along with permafrost melting and frost-thaw cycles, lead every year to several landslides (Giardino et al. 2004; Ratto et al. 2007; Cignetti et al. 2016).

Several areas along the Apennines are also highly involved with landslides.

Liguria is the region that shows the highest number of events and days associated with landslides; its territory, in fact, is characterized by steep slopes with few flat areas along the coast and in the valleys. These areas are very urbanized, that, in combination with the land use, involve a geomorphological evolution characterized by a high presence of landslide and flood events.

The areas less involved are located along North-East coast and in Puglia because they are mainly flat areas, most likely to be affected by other geo-hazards, as floods.

The temporal distribution of the articles and therefore of the landslide events increases from 2015 to 2019; the average number of days with landslides increased from 3 in the period 2010–2014 to 5 in the period 2015–2019. For this reason, it could be due to several aspects as the increasing number of high-intensity rainfall events or land use changes (Crozier 2010). Some authors have related the global climate change to a rise of the global temperature with a more frequent occurrence of extreme events in general (Rebetez et al. 1997; Easterling et al. 2000; Rosenzweig et al. 2008; Knight and Harrison 2009; Keiler et al. 2010) such as intense and localized precipitation. Theoretically, all these climatic parameters may influence the pre-conditions and triggering mechanisms of landslides and hence may lead to an enhanced frequency of landslides in general (Beniston and Douglas 1996). Furthermore, the inaccurate land use management can lead to the increase of mass movements in the whole Italian territory over time (ISPRA 2020a).

Landslide events have a certain seasonal distribution during the 10-year observation period. Indeed, during wet season (from October to April), the landslide events are more frequent since fall and winter are the rainiest period of the years.

Vice versa the frequency is lower during the dry season (from May to September), even if isolated landslide events can be found, usually related to sever storms that strike small areas (few tens of square km).

These results are in agreement with literature works where seasonal distribution of landslide was investigated in Campania (Cascini et al. 2014) and Toscana (Rosi et al. 2012) regions or at national level (Guzzetti et al. 2005; Calvello and Pecoraro 2018)

A relevant number of landslides have also been reported in February and March: These months coincide with the end of the winter and a rise in temperature, associated to snowmelts, which is a well know landslide triggering factor in Italy (Cardinali et al. 2000).

The year 2019 is the year with the highest number of “landslide events” and involved days; according to ISPRA (2020b), the mean cumulative rainfall of this year was 12% higher than the mean over the 1961–2019 period, and autumn and springtime were 47% and 19% more rainy than usual. The years 2013 and 2014 have a high number of days with landslide news, because they present several articles, distributed over a long-time interval, about the Mont de La Saxe landslide in Valle d’Aosta Region. The Mont de La Saxe landslide is a rock fall type landslide threating a valley with buildings, streets and a river (Giordan et al. 2015). It suffered several reactivations over the time that caused damages or lead to road closures, and each time new articles have been published, and more days with landslide news have been recorded.

The distribution of news (classes 1 + 2) shows a certain correlation with the percentages of landslide hazard areas and of people at risk. Liguria region has a high percentage of hazardous area (58%) and a very high percentage of people living in landslide risky zones. While Lombardia has a very high number of news, but low percentages of territory subject to landslide hazards and of people at risk; this is due to the fact that landslide are concentrated along the Alpine arc (see Fig. 5B), in the northern part of the region, where there is a lower population rather than in the southern, plain part.

Even if the used approach gave the opportunity of taking a picture of landslide distribution in Italy, it is worth to notice that mass media attention is not uniformly distributed across disaster-affected areas (Fan et al. 2020). The classification was necessary since each landslide event can be reported from one or more newspaper based on its impact of the relevance of the affected area; for example, small landslides involving a major road or an important city can have a vast media echo, while landslides involving minor roads or small villages are reported only by local newspapers. In some cases, the presence/absence of news could be affected by other factors such as disruption in communication services, socio-demographic factors (the events affecting socially vulnerable populations get less attention) and absence of points of attraction. Furthermore, a landslide can have more reactivates (see La Saxe landslide) during time and therefore more articles published. These factors can alter the real distribution of landslide hazard, leading from one hand to underestimate the presence of landslides in rural areas, forest or without a journalistic relevance and to the other hand to overestimate the hazards in most relevant areas from a journalistic point of view.

One last observation must be done on the spatial resolution of the used data. Since online newspapers are the used source of info, the accuracy of some parameters could be low, in fact inside newspaper articles some technical details, as the type of landslide, its dimension or volume are often missing; the exact location of the landslide is a parameter rarely available from this kind of source as well. For these reasons, the use of newspaper articles may be useful for analyses over large areas, but not to create detailed landslide inventory or for detailed analyses.

Conclusion

Newspaper articles inside Internet or crowdsourcing platforms can be regarded as a constant and continue source of information about a recent landslide with a high impact and consequences in terms of loss of infrastructure and human lives.

In this work, a spatial and temporal distribution of landslide events in the Italian territory has been presented. These analyses have been carried out using online landslide news harvested by SECaGN algorithm from 2010 to 2019. The news database was classified in three classes on the basis of news relevance, localization accuracy and time of publication so as distinguishing the “news referred to recent landslide events” (class 1) from “news generically referred to landslides” (class 2) and “news not related to landslides” (class 3). This classification allowed to define, at national scale, the areas and periods mainly involved by landslide events. Around 41% of the news reported information relative to recent landslide events, and only 2% of the database is made up by wrong news. Through a semantic analysis, it has been possible to check the words with the highest frequency inside the headlines of newspapers. This allowed to define the principal words that describe the landslide events, which, in turn, can be used to properly tune up the data mining algorithm, limiting the news with wrong word association.

Based on the results of this work, it is possible to conclude that the events and news are increasing from 2015 to 2019 in the whole Italian context. November is the month with the highest amount of landslide events. Lombardia, Liguria and Campania are the regions that have the highest number of days characterized by landslide phenomena and consequently the highest number of news publication. Data also showed that ca. 42% of Italian municipalities have been affected by landslides in the observed.

More in general, this updated landslide inventory allowed to get the overall number of landslides events since 2010 to nowadays. Finally, this inventory can be used to get relationship between the more detailed news (class 1) and rainfall to create rainfall thresholds, but it can be also used for hazard and vulnerability assessment (class 1 and class 2 news).

This work showed that data mining is a reliable methodology to create a good landslide inventory in a relatively short time, even if with a coarser spatial accuracy than traditional inventories. Future developments could introduce the capability of being used for other hazards and in other languages, once appropriate dictionaries for the semantic engine will be available.