Introduction

The damming of rivers due to landslides and the following consequences pose great threats to people and facilities locally as well as in downstream areas. These landslide dams, which are effectively a subset of landslides that happen to block rivers, are reported in many areas around the world, almost exclusively in mountainous areas (e.g., Costa and Schuster 1988; Fan et al. 2020). The landslide dam (LDam) is an event that forms when the mass of a landslide or multiple landslides from the adjacent hillslopes, partially, or completely blocks the normal fluvial channels (e.g., Costa and Schuster 1988; George et al. 2019). Hundreds of LDams have occurred among the thousands of landslides generated in each triggering event, such as from earthquake or intense rainfall (e.g., Fan et al. 2012). Additionally, the economic and life loss of global LDam can accumulate into a considerable amount, as each event can cause the loss of millions of US dollars and many casualties (Dai et al. 2005; Tacconi Stefanelli et al. 2016; Fan et al. 2017). Moreover, LDams play an important role as an interface connecting hillslope and fluvial channel systems in geomorphology (e.g., Korup 2002).

The primary consequence of LDams is the subsequent flood hazard resulting from the formation of a LDam. The potential impacts of LDams on ongoing floods can be divided into two parts: (1) upstream consequences, backwater floods, induced by rise of water level in the upstream area at the point of the LDam; (2) downstream consequences, downstream floods, mostly caused by the LDam failure, either by overtopping or breaching within short timescales (hours to days) (Zhang et al. 2016), depending on their dam material components (volume, texture, angle of repose, sorting), rates of seepage through the dam, and rates of LDam lake input and outflow (Costa and Schuster 1988; Korup 2004; Tacconi Stefanelli et al. 2018). In contrast to river flooding, which normally occurs after intense or prolonged precipitation or increasing snowmelt, the floods caused by LDam failures are more complicated to predict due to the rapid water release from the impounded lake when the dam fails, and the complexity of the LDam stability. The flowrate of dam-breach floods can increase river flows to many times typical flood flows experienced in a river system and can even reach a rate much larger than the flowrates of recorded flash floods (Perucca and Angillieri 2009). The impacts of dam-breach flooding can extend to broad areas since the distances of dam-breach floods can vary from 1 to more than 1000 km (Geertsema 2008; Evans et al. 2011; Macias et al. 2004). The flood wave generated when the landslide debris enters the fluvial channel can have a significant impact, such as a water level tens of meters above the mean water level (Wang et al. 2004), or a large peak discharge flood wave (e.g., 5900 m3/s reported by Dunning et al. 2006). These can have the potential to kill thousands of people in the downstream areas (Barla and Paronuzzi 2013).

The typical approach to study LDams currently is to establish a dataset of case studies and undertake analyses of the hazard-prone conditions of LDam formation from the empirical relationships using geomorphological parameters from the LDam and geomorphological records (e.g., Fan et al. 2012; Tacconi Stefanelli et al. 2016, 2018). The physical processes are very hard to capture accurately due to the lack of observational data. The formation of LDams is difficult to forecast, which makes systematic monitoring difficult. Even though LDam events are not uncommon, the records and datasets for this specific hazard are relatively sparse compared with its frequency of occurrence, especially on a global scale. The establishment of local, regional, or global LDam datasets has begun in earnest in the last few decades by collecting records from literature, field investigations, or remote sensing data (e.g., Costa and Schuster 1991; Fan et al. 2012, 2020; Tacconi Stefanelli et al. 2016, 2018; Zhang et al. 2016). These previous studies have explored and proposed frameworks for data fields to be recorded in the LDam datasets, which are easy to use and update, even for non-experts, and have discussed LDam classifications, processes, origins, distributions, and stability on both a regional and global scale. Most of the records collected in these datasets were triggered by single events, mainly earthquakes (e.g., the 2008 Sichuan earthquake). These events result in a more comprehensive area coverage and data accessibility than single a LDam that may occur due an isolated landslide. There are some studies on the formation, stability, and short-term impacts of LDam that use these datasets (Ermini and Casagli 2003; Korup 2002; Fan et al. 2014), and most of them focus on studying the geomorphologic features of hillslopes, landslides, and the LDams.

However, the systematic understanding of the quantitative relationships between LDam formation and related geomorphological parameters on the global scale are still unclear due to the limited accessibility of valid records and parameters and the large variety of the local conditions. Most global LDam formation studies rely on descriptive analysis of case studies or quantitative relationships based on regional LDam studies or global landslide studies (Larsen et al. 2010; Tacconi Stefanelli et al. 2016, 2018). In previous LDam datasets, some of the geomorphological parameters of the landslides and blocked river valley came from empirical statistical relationships based on hydrological, geomorphological, or landslide studies (Evans et al. 2011), while others came from records or other accessible data sources(Tacconi Stefanelli et al. 2016, 2018; Fan et al. 2020).

To fill these gaps, in this paper, we develop a new global scale dataset of LDams, abbreviated as RAGLAD (River Augmented Global Landslide Dams), using a united spatial reference system and measurement units to allow a better understanding of the spatial distribution and characteristics of LDams and geomorphological conditions that might affect their formation on the global scale in relation to fluvial information. The records are collated from a wide range of data sources in multiple languages and geolocated with precise and accurate spatial information, where possible. The parameters of these records are then extended by linking the locations with recently available global fluvial datasets. This study is the first time that the combination of landslide databases with fluvial datasets has been presented. RAGLAD focusses on a global scale and allows us to explore and better understand the spatial distribution and geomorphological characteristics of LDams. For example, the relationships between geomorphological parameters, such as landslide volume and river width, are developed to explore the connection between geomorphological parameters and further reveal potential parameter thresholds for LDam formation from a global perspective.

Methods

Global landslide dam record collection

For a more comprehensive coverage in this dataset, the records in RAGLAD were collected from a wide range of sources: academic journal articles, government and institution reports, social media, and other available datasets in multiple languages (for more details, please see the supplementary materials in Appendix A). The languages that were used most systematically for searching for relevant records were English, Chinese, and Japanese, although other languages related to the location of collected LDam records were also used, where available. A particularly careful focus of our data collection was in developing countries, where there was a distinct lack of records in previous studies. The measurement of geomorphological data is obtained directly from publications or extracted from published figures. An added difficulty is that current LDam records may include several events in one record. In order to create a unique LDam record for a particular location, where a LDam reference consists of multiple separate events and contain information that comes from different data sources, we need to precisely geolocated each LDam when its approximate location is available. During data entry, when conflicting data from various sources is presented for one record, data from field investigations and those records reported most recently to the LDam formation time are prioritized. We focused on more recent LDam events (in the last 1000 years) due to the sparsity of records and data from times before this, and because the reliability of the data is much less clear.

Previous LDam databases consist of similar geomorphologic parameters to landslides records, as well as parameters related to the LDam body, river valley and fluvial channel, impounded lakes, and the general information of LDam events which were helpful in building our database and locating original sources of information (Costa and Schuster 1991; Fan et al. 2012, 2020; Tacconi Stefanelli et al. 2016, 2018; Zhang et al. 2016; Zheng et al. 2021). The records in RAGLAD dataset contain information on the spatial and temporal information from dam formation to dam breach, dam materials, geomorphological characteristics, and dimension data of upstream catchments at the point of blockage, landslides and impounded lakes, and hydrographic characteristics of subsequent flood events and their consequences, including casualties and economic losses, as shown in Table 1. For a better understanding of the relationships between landslide types and the geomorphological characteristics of LDam formation, we applied the landslide definitions established by Hungr et al. (2014), including fall, topple, slide, spread, and flow, to recategorize the original landslide types from LDam record. Some descriptive and supplement information of the LDam events is also included to allow easier updating of the spatial location and geomorphological parameters in the future, if necessary.

Table 1 The information field of LDam records (for more details, check with Supplemental Materials)

On the global scale, 84% of valley blockage resulting from LDams were reported as being caused by rainfall and earthquakes (Schuster and Costa 1986). Similar result can be also found in the study by Zheng et al. (2021) with 50.4% cases induced by earthquakes and 39.3% by rainfall. Considering the triggering mechanism of landslides with large volume, the result is slightly different as 44.4% landslide were caused by rainfall, and 20.5% caused by earthquakes (Fan et al. 2020), so a smaller percentage. At that scale, localized effects that are also important cannot be considered. However, on a local scale, the fluvial erosion of channel beds/banks and erosion processes that change the condition of hillslopes are known to influence the hillslope stability (Golly et al. 2017). Based on RAGLAD, triggering processes of landslides that have formed dams can be summarized into three major categories: (1) meteorological triggering processes, including meteorological events, such as storm, snowmelt, and intense precipitation, and the alternation of the seasons, when the temperature and humidity change sufficiently; (2) geological precondition and triggering processes, including the geological background that is prone to LDam occurrences, such as layers and lithological characteristic of the bedrocks on the hillslopes and long-term tectonic movement, infiltration towards potential or current landslide bodies that could alter the shear strengths of surface materials and induce potential landslides, and geological hazards such as earthquake, volcano eruption, and landslide reactivation; and (3) geomorphological triggering processes, such as the fluvial geomorphology changes caused by fluvial erosion processes at the base of a hillslope, alluviation, or flood hazards.

Geolocating LDam records

The precise and accurate spatial location of LDam records is crucial for further geospatial or geomorphological analysis and to allow linking of parameters from other data sources (Fan et al. 2020). However, most of the spatial information in the records in current global datasets was recorded as approximate location descriptions, without precise spatial coordinates. Hence, these datasets cannot be used to link records with other geolocated data. There are three key challenges in recording the precise spatial coordinates of records: (1) LDam can form and fail in a very short time and thus its precise location can escape being recorded; (2) older LDam events were not recorded with very precise locations because the locations were derived from the geomorphological or sedimentary relics, which may not be clearly presented in modern terrain (Tacconi Stefanelli et al. 2016); and (3) collating data recorded in different data formats, inconsistent units with vague and patchy spatial information impedes the process of assigning precise coordinates and can result in the imprecision of spatial information.

In this study, the records were geolocated individually by georeferencing, projection transformation, or based on the geomorphological information and location description provided in the original data sources. To improve the spatial precision of the records, we visually scanned the target area using Google Earth to find the geomorphological signature of the landslides that had created the dams. These signatures include landslide scarps, the extreme color differences on the ground caused by surface vegetation changes or the loss of soil cover, hazard mitigation infrastructures, and existing LDam bodies and impounded lakes, and then matched the approximate location with the description and images reported on social media or local reports to pin down the precise spatial location of LDam records (for more details, check with Supplementary Materials Appendix B).

Appending the data from the global fluvial dataset to landslide dam records

Even though the number of LDam records increased during the past decades, some of the data, especially the geomorphological data, such as river width and drainage area at the point of the LDam, is not valid in every information field for each record. The LDam body and its deposits may be removed by erosion, in some cases even within a few hours after its formation and thus sometimes it is impossible to record their dimension by field investigations. When most records are geolocated, linking accessible parameters from other data sources by spatial proximity becomes possible. This allows us to complete some records that lack valid data on the fluvial system parameters, such as upstream drainage area and river width. In parallel, with the development of digital elevation models (DEM) in recent decades, the geomorphological parameters can be easily obtained. Therefore, linking the geomorphological data from the validated global fluvial dataset can be a valuable approach.

Global fluvial datasets (GFDs) have made significant progress in recent decades. Lehner et al. (2006) released HydroSHEDS (Hydrological data and maps based on SHuttle Elevation Derivatives at multiple Scales) derived primarily from elevation data from the Shuttle Radar Topography Mission (SRTM) at the spatial resolution ranging from 3 arc-second (~ 90 m) to 5 min (~ 10 km). Yamazaki et al. (2014) developed a new global river width database to provide fluvial data sources with global coverage for data supplemented with river widths. Allen and Pavelsky (2018) completed the first global compilation of river planform geometry based on the Landsat images. Linke et al. (2019) published the HydroATLAS database providing the descriptive hydro-environmental information for worldwide watersheds and rivers at 15 arc-second (~ 500 m) resolution. In the same year, the first global dataset including mean annual flow, river width, slope, meander wavelength, sinuosity, and catchment area was created from river centrelines derived from Landsat images and Shuttle Radar Topography Mission DEM (Frasson et al. 2019). Yamazaki et al. (2019) published MERIT Hydro, a new global flow direction map at the resolution of 3 arc-second (30 m) derived from the latest elevation and waterbody data. Among all these GFDs, we selected the raster-based MERIT Hydro dataset for data assembling of drainage area and river width data because it reduced the vegetation biases from the elevations of satellite-derived DEMs, which can lead to an imprecise water body representation in forest areas, and includes the correction of many other DEM errors (Yamazaki et al. 2019). The global empirical relationship between river width (\(\mathrm{W}\)) and drainage area (\(\mathrm{A}\)) achieved from the study of Frasson et al. (2019) was also selected for calculating the river width for further comparison (Eq. 1), because the MERIT Hydro does not cover all the smaller catchments. We compare the fluvial data from original records with that derived from GFDs by retaining the data from original records and appending all the valid data from the GFDs to evaluate the assembled performance of these different sources.

$$\mathrm{W}=9.68{\mathrm{A}}^{0.32}$$
(1)

Results

Observed trends in the landslide dam database

In total, 779 LDam records were compiled from 34 countries/regions. The locations of the records are clustered in the mountainous areas around the world, especially in the areas including European Alps (Tacconi Stefanelli et al. 2016), Rocky Mountains (Costa and Schuster 1988; Clague and Evans 1994), Andes Mountains(Hermanns et al. 2011; Tacconi Stefanelli et al. 2018), Pamir Mountains (Storm 2010), Himalayas (Evans et al. 2011), the eastern edge of Tibet Plateau (Yin et al. 2009; Xu et al. 2009; Fan et al. 2012), and some mountainous areas on islands (Nash et al. 2008) (Fig. 1). In terms of spatial information of the new database, 85% of the records (666 out of 779 records) contain location information with a precision of approximately 0.01 degree (~ 1 km) and 583 records (~ 75%) have a precision better than or equal to 0.001 degree (~ 100 m) after geolocating processes. The influence of geolocating in the data completeness is shown in Table 2 (for all parameters, see the supplementary materials in Appendix D). The diversity of data availability in different countries due to funding, expertise availability, and disaster management policies strongly affects the spatial data distribution. For instance, most of the current LDams are recorded in developed countries, such as the USA, Canada, Italy, and Japan, and some developing countries with plenty of researchers in this study topic (e.g., China, and Argentina).

Fig. 1
figure 1

Spatial distribution of LDam records with precise geospatial locations in RAGLAD (a. Peru; b. Italy and southern Europe; c. Himalaya areas; d. Japan; e. New Zealand)

Table 2 Data entry completeness of specific critical parameters. Bold values are improvements after geolocating and adding Global Fluvial Datasets

The recorded date of LDam formation shows a clear increasing trend in the number of records during the past 1000 years, with the highest number in the last 20 years (Fig. 2). This increase is probably due to the greater amount of landslide research and the growing interdisciplinary interest in multi-hazard research, particularly in mountainous areas, rather than an actual increase in events. LDams that occurred in the past may have only been recorded occasionally, and this leads to the under-representation of LDam numbers in the past (Tacconi Stefanelli et al. 2016). Extreme hazard events such as the 1783 Calabria earthquake in Italy, the 1889 Totsugawa Flood in Japan, the 2008 Sichuan earthquake in China, and Typhoon Morakot in 2008 have contributed to the notable peaks in the LDam formation time trends.

Fig. 2
figure 2

Formation time trend of LDam event, inset shows the last 70 years in more detail

Landslide dam triggering processes

The database records show that the LDams were triggered by multiple factors. Among the 506 LDams with the recorded categories of landslide triggering processes (Fig. 3), 314 were induced by extreme geological events, 151 were triggered by weather events, and 65 were directly caused by fluvial landform system changes. It must be noted that a single LDam record may have more than one triggering factor. The results indicate that the triggering processes of geology and meteorology dominantly control the formation of LDam around the world as 62% of LDam s are induced by geological triggers, 30% of LDam records caused by meteorological triggers, and 13% LDams are caused by geomorphological triggers. This finding is similar to that by Schuster and Costa (1986), who that found more than 84% of global natural dams were triggered by factors related to earthquakes and precipitation and Zheng et al. (2021) with 50.4% of landslide triggered by earthquakes and 39.3% induced by rainfall. A slight difference can be found for long run-out distance landslides according to Fan et al. (2020), who found that 20.5% of landslides were triggered by earthquakes and 44.4% were induced by rainfall. However, it is not clear that the LDam triggering processes result is due to the actual hazard occurrence or some of these triggers happen to attract more research attention. Additionally, these triggering processes can place a strong bias on the dataset by producing a lot of landslides and landslide dams at once.

Fig. 3
figure 3

Venn diagram of landslide triggering processes categories (A, alternation of season; I, instant weather event; E, extreme geological hazard; G, geological background; number followed the categories is the number of records with specified landslide triggering processes categories)

Geomorphological characteristics of landslide dam records

As observational data of the LDam formation processes in action are rarely available, summarizing the geomorphological characteristics of upstream catchments, and the landslides from records can be helpful when considering the geomorphological conditions prone to LDam formation. We consider here the 4 geomorphological characteristics of LDams in the RAGLAD database, including drainage area (at the point of LDam), river width (at the point of LDam), the ratio of the landslide drop height to landslide runout distance (H/L ratio), and landslide volume (Fig. 4), which can include the geomorphological conditions from the blocked river channel and its triggered landslide from the adjacent hillslopes. This allows us to study two specific factors: (1) where LDams occur along the course of a river, and (2) what kind of landslides are likely to dam a river.

Fig. 4
figure 4

Data distribution of geomorphological parameters of LDam records: (a) drainage area; (b) river width; (c) landslide H/L ratio; (d) landslide volume

Based on the geomorphological data of the blocked river channel, the LDam records occurred more frequently in the upstream areas of river systems where the slopes tend to be steeper and therefore, more likely to fail. The geomorphological characteristics of the drainage area at the point of valley blockage shows an exponential distribution, with most records having a drainage area less than 500 km2 (Fig. 4-a). Strahler stream order is used in river morphology as a measure of the stream network connectivity. The most upstream reaches of a river are classed as order 1 and when these join another stream of order 1, then the reach becomes an order of 2, and this continues downstream with ever increasing stream orders (Geological Survey 1965). The range of stream orders for the records vary from 1st to 5th order. This large number of low stream orders, as well as the smaller catchment areas, confirms that LDams usually occur in upper river catchment areas. The data distribution of river width suggests that the LDam is more likely to occur in smaller rivers from the range of 0 to 300 m width, as expected, which is similar to the result shown for the drainage area distribution (Fig. 4-b).

The geomorphological data distributions of landslide dimension parameters indicate that a larger volume and longer run-out distance of a landslide can contribute to the LDam formation prone conditions. However, the condition depends on specific landslide types and LDam locations. The H/L ratio is a popular parameter for measuring the mobility of a landslide, even though it is still under discussion within long-runout landslide literature (Iverson 1997; Legros 2002; Shanmugam and Wang 2015), and its mechanical meaningfulness as a description of friction has been refuted several times (Hsü 1975; Davies 1982; Dufresne and Geertsema 2020). The result of the landslide H/L ratio shows a Weibull distribution with most H/L ratios concentrated in the range from 0.12 to 0.6 and this indicates the contribution of relatively long run-out landslides to LDam formation. Typical landslides have a H/L ratio smaller than 0.5, with some well-studied examples reaching a value of 0.1 to 0.2 (Iverson 2015); this work on debris flows demonstrates that the run-out distance can be considerable. For example, the debris flows, debris avalanches, and rock slides in west central British Columbia have an H/L value between 0.1 and 0.5 (Geertsema et al. 2009). Scheidegger (1973) described a general trend of a reducing H/L ratio with an increase in volume and suggested that some obstructed mudslides and earth flows will have a lower H/L ratio than other landslides. The fact that the majority of landslide movement types that formed LDams include slide (> 28%), complex (> 24%), and flow (19%) can affect this data distribution because the H/L ratio is highly influenced by the landslide type.

The data distribution of landslide volume from landslides that block the valley reveals an exponential distribution, with 68% data concentrated in a volume of less than 10 million cubic meters. Among all the data, it is interesting to note that for more than 17% of LDam records (132 out of 779 records), landslides with a small volume (< 1 million m3, as a volume threshold of long run-out rock avalanches/slides defined and applied by Glastonbury and Fell 2008; Evans et al. 2011; Davies and McSaveney 2012; Robinson et al. 2015; Chunyuk et al. 2020; Fan et al. 2020) also plays an important role in contributing to the total amount of records.

The data distribution of landslide geomorphological characteristics could be the result of various reasons:

  1. 1)

    the proportion of larger landslides is relatively small in LDam studies. However, compared with those in general landslides studies, the landslides causing LDams with a volume larger than 1 million cubic meters account for a larger proportion of the total (75%). In the study of Guzzetti et al. (2009), the landslides with a large volume took up approximately 29% of 667 landslides in Umbria, central Italy. In the national scale landslide dataset of Slovenia, the large volume landslides account for approximately 0.4% of the total (Komac and Hribernik 2015);

  2. 2)

    the result is affected by both the river width and H/L ratio distribution result, because to block a larger river width requires a larger volume of landslide material, which may require a longer landslide runout distance from a lower H/L ratio. However, from previous studies, the increase in L/H (opposite to H/L ratio) with increasing volume of landslides was not observed for both small and large landslides (Roback et al. et al. 2018) and Okura et al. (2003) reported that there is no correlation between the volume and H/L ratio for shallow landslides under the volume of 103–104 m3;

  3. 3)

    The result could be affected by landslide types and the actual volume of the LDam blocking the river. For some landslide types, the volume that actually blocks the river is much smaller than the total landslide volume (~ 10% for the cases covered in Miller et al. 2018).

Empirical relationships between parameters

The relationship between landslide volume and landslide area, based on the LDam records collected in this research, is shown in Fig. 5. As landslide type is a principal factor for determining the relationships between landslide runout and volume (Legros 2002), we have measured the relationships grouped by different landslide movement types. As shown in Eq. 2, the relationship between landslide area and volume is presented as a scaling relationship. Previous studies (Guzzetti et al. 2009; Larsen et al. 2010; Fan et al. 2012, 2014) have shown its applicability to a broad range of landslide types.

$$\mathrm{V}=\mathrm{\alpha }{\mathrm{A}}^{\Upsilon }$$
(2)

where \(\mathrm{V}\) is the landslide volume (m3), \(\mathrm{\alpha }\) is the intercept, \(\mathrm{A}\) is the landslide area (m2), and \(\Upsilon\) is the scaling exponent. A previous study by Larsen et al. (2010) established a global prediction equation of the exponential relationship between landslide volume and landslide area that was based on more than 4000 landslides in both soil and rock types collected globally. The value of \(\Upsilon\) derived for landslides in soil is 1.1–1.3. For landslides which occurred in rock masses, the range of values is 1.3–1.6. This relationship has been directly applied to both landslide and LDam studies on the regional scale for calculating landslide volumes (Fan et al. 2014; Tacconi Stefanelli et al. 2018). The smaller scaling exponent of landslides that dam the river, with a range from 0.66 to 0.97, indicates that these landslides typically have smaller volumes compared to landslides more generally (i.e., including those that do not cause LDams). This situation could be due to the erosion depth of landslides that dam rivers being shallower or having steeper slopes comparing with the general landslides. However, as the scaling exponent varied with slope materials in the V-A scaling exponent study in global landslides (Larsen et al. 2010), it is common to expect that most of the coefficients of determination (r2) in these relationships are relatively low (< 0.5) because of the vast combinations of landslide mass materials, patchy data sources, and different triggering mechanisms of landslides.

Fig. 5
figure 5

Correlation between landslide volume and landslide area for 260 records with valid data, categorized by landslide types. a Slide. b Flow. c Fall. d Complex. e Long run-out debris flow (Legros 2002)

To determine whether a landslide can actually form a LDam, it is vital to know the empirical relationship between landslide volume and river width from current records. The empirical relationships within the geomorphological parameters, especially the relationships between landslide volume and area, were applied in establishing geomorphology indexes for evaluating the LDam formation probability and stability (Tacconi Stefanelli et al. 2018). The correlation between landslide volume and river width revealed similar relationships within the categorized landslides that formed the LDam, but the relationships are not clear, with a lot of data noise (Fig. 6). LDam formation index from previous research, such as Morphological Obstruction Index (MOI), also applied the relationship between valley width and landslide volume (Tacconi Stefanelli et al. 2016). However, we have to point out that there is a difference between valley width and river width. River (channel) width is based on the mean annual flow (Frasson et al. 2019), while the valley width derived from the valley morphology system rather than just the river, so valley width is always wider than river channel width. One of the reasons for the data noise may came from the uncertainties resulting from the original records. Nonetheless, although a correlation between landslide volume and river width is not possible, it still reveals a potential threshold for LDam formation (Fig. 7). Only one outlier from the dataset in Italy (Tacconi Stefanelli et al. 2016) is above this threshold line, and it dammed a valley of 400 m width with 9000m3 debris. No further information was found regarding this outlier. The threshold can be applied to explore the LDam formation prone areas where landslides can generate sufficient volume of mass to block the river: i.e. the minimum landslide volume that forms a LDam can be calculated from a given river width.

Fig. 6
figure 6

Correlation between landslide volume and river width (as defined in Table 1) for 250 records with valid data, categorized by landslide types. a Slide. b Flow. c Fall. d Complex

Fig. 7
figure 7

The potential threshold of LDam formation by the relationships of landslide volume and river width

Global fluvial datasets parameter results

The data distribution results of the drainage area when combined with the global fluvial datasets (GFD) reveals three similar Weibull data distributions of drainage area developed from the following: (1) the original LDam records only; (2) GFD data only; and (3) RAGLAD dataset combined with the GFD data (Fig. 8). The drainage area data collected from the GFDs contain more values for the areas less than 20 km2, which are lacking in the original LDam records. When zooming into these upstream catchments with smaller areas, we can also see that the data gathered from GFD only shows as integer values (due to the GFD format) and this could lead to a larger data value difference and distribution in narrower rivers (drainage area < 20 km2). In contrast to the very similar distributions for the drainage area, the distributions of river width from various data sources are more diverse (Fig. 9). The data difference exists mainly in the rivers with a width less than 50 m, whose width data were mainly collected from the GFDs. Therefore, one of the most significant reasons for the difference between drainage area and river width is the amount of valid data between these 2 parameters; there are 627 records containing valid data on drainage area while there are only 303 records that have a valid river width values from the original records.

Fig. 8
figure 8

Data distribution of drainage area from RAGLAD, GFD, and RAGLAD with the supplement of the GFD (bin interval of 10), inset shows the data distribution in the first bar in more detail (bin interval of 1)

Fig. 9
figure 9

Data distribution of river width from RAGLAD, GFD, and RAGLAD with the supplement of GFD and data derived from empirical relationship to drainage area; inset shows the data distribution of river width from 0 to 400 m

Both the data supplement of river width from the GFDs and the empirical relationship between river width and drainage area fill a significant data gap in the current LDam datasets, particularly in the range of river widths less than 100 m (Fig. 9). The data distribution is slightly different after data combination, but is still very similar, but with differences concentrated at the smaller river scales where there is the most difference in the data. The data gap between LDam records and GFDs is concentrated in small rivers and catchments, which also raises the need to improve the GFD performance in smaller catchments. We also compared the data extracted from different data sources in each record (Fig. 10). For those LDam records with a valid river width value, the GFD data gathered from MERIT hydro performs better than river width data estimated by drainage area simply using the empirical relationship of drainage area and river width (Frasson et al. 2019).

Fig. 10
figure 10

River width data comparison among LDam records, GFD, and data achieved from the global W-A empirical relationship (Eq. 1 from Frasson et al. (2019))

Discussion

We have presented the data distribution and parameter correlation results achieved from our new-established RAGLAD dataset. Record gaps of river morphology parameters have been filled, where possible, with values derived from global fluvial datasets. There is some obvious data noise when correlating geomorphological parameters based on original LDam records. The results also showed that obtaining data based on data combination with GFD results in better performance, comparing with calculating a specific parameter value based on established generalized global empirical relationships. However, it is also notable that the data combination results can affect the data distribution when the valid data in original records are limited (less than 50% of records with valid data). Our result indicates that although the GFD can be a better supplement source for current records than supplying data based on empirical relationships from other geomorphological parameters, there are also some uncertainties that exist that affect the data combination result, and this can reduce the accuracy of resulting data.

A possible reason for inaccuracies could be the potential data recording bias and uncertainties caused by original LDam records, including the inaccuracy and imprecision of spatial information or geomorphological data occurring during the recording process. Depending on the quality of data sampling and age of the event, the spatial inaccuracy and imprecision of the records can become one of the most important sources of uncertainty in the data when recording dimension data from other data sources, because it can link inaccurate data to the record (Tacconi Stefanelli et al. 2016). Some of the geomorphological parameters from the records, such as landslide volume and landslide area, were estimated based on the empirical relationships or achieved from remote sensing data, because the data cannot be obtained before the LDam disappears (Costa and Schuster 1988; Fan et al. 2020). The records from RAGLAD came from various landslide types and sizes and this can also lead to uncertainty in the mechanism of LDam, if it is a result of more complex combined processes. Since a lot of records from the RAGLAD dataset are clustered in the smaller catchments and the fact that smaller catchments have less persistent river flow fluctuations (Hirpa et al. 2010), it is possible to speculate that there may be a larger data difference between obtained data from the literature and actual event investigation data because of the temporal gap. Inconsistency in the terminology used during the data acquisition of LDams and landslides from different works of literature can lead to some confusion. For instance, the volume of the landslide may refer to either the volume of the landslide or the total landslide excluding the LDam body (Korup 2004), and the length and width of a LDam body from different studies could be used interchangeably (Costa and Schuster 1988). Additionally, the data collected from the various data sources or recorded in different data formats, inconsistent units, and spatial references can increase the uncertainties of data collation.

As the GFD datasets contain global-coverage hydro-morphological data, they can provide a reliable source for supplementing the original records. Therefore, another possible reason that can cause inaccuracy of data entries can also come from the GFD datasets themselves. Although GFDs have been developed over the past decade with a more precise representation of river networks, there remain some limitations within the current GFDs: (1) GFDs generally apply single flow direction methods, such as the D8 algorithm, for generating the flow direction map and thus they contain no channel bifurcations because the flows in the upstream areas tend to become concentrated to a distinct single flow direction (Tarboton 1997; Seibert et al. 2007); (2) most of the geomorphological characteristics or relationships between fluvial parameters from GFDs were recorded with mean annual values, and there may exist a temporal gap between the time of LDam formation and the time of data collection and thus result in the data difference between GFD and actual LDam events; (3) small rivers are still poorly represented due to the limitation of horizontal spatial resolution of the DEMs that these GFDs are derived from (pixel size of raster grids) (Yamazaki et al. 2014). These limitations of current GFDs can lead to the underestimation or overestimation of related fluvial parameters. For example, Fig. 11 (a) shows an example of the situation when the GFD data did not contain river channels where some records were located. Figure 11 (b) shows the landslides in Peilong valley that reoccurred in the same place several times in the 1980s (Li et al. 2020), which actually blocked a tributary nearby instead of the mainstream presented in the GFD data, so the data of fluvial related parameters could be inaccurately linked to the mainstream instead of the tributary, if using automated geolocation methods.

Fig. 11
figure 11

Examples of uncertainty due to the limitations of GFDs: missing representation of features and inaccurate assembling result due to imprecise location: (a) missing fluvial channel representation in small catchments; (b) inaccurate data assembling result at the channel junctions due to the imprecision of spatial information

In summary, the uncertainty of data from LDam records assembling with GFDs can be summarized into two aspects: (1) the limitations of GFDs; (2) uncertainty caused by geolocated error in the records. To further address which aspect accounts more for the data assembling uncertainties, we apply a measurement called relative error for measuring the data difference between these two datasets. The relative error (\(\delta x\)) is defined as Eq. 3:

$$\delta x=\frac{x-{x}_{0}}{x}$$
(3)

where \(x\) is the data value from original LDam records, \({x}_{0}\) is the estimated value (data from GFDs). The distance from the LDam record to the nearest data point does not proportionally increase with the relative error of drainage area and river width (Fig. 12). The noise observed on the plot suggests that the spatial precision of records may not be the major cause of the data difference.

Fig. 12
figure 12

Correlation between the relative errors of drainage area and river width, and the distance from LDam records to the nearest GFD point

Conclusions

Linking other additional supplemental data to geolocated LDam datasets was highly recommended in previous LDam research (e.g., Tacconi Stefanelli et al. 2016, 2018; Fan et al. 2020). We have successfully created a geolocated landslide dam inventory and linking the records with other available fluvial data sources to supplement the parameters in the data. Our new global LDam dataset is called RAGLAD and contains 779 records. The records contain spatial coordinates, time information, dam materials, geomorphological characteristics of catchments, landslides and impounded lakes dimensions, and hydrographic characteristics of subsequent flood events and their consequent damage. The data were compiled from literature in mainly 3 languages. The geospatial accuracy of the records in RAGLAD was enhanced to reduce the uncertainty when linking with global fluvial datasets. RAGLAD can be useful for researchers and global communities to further explore and discuss the LDam formation and risks, and the data are openly available through the authors.

For the first time, we supplement LDam records with auxiliary data from GFDs. The result shows that GFDs can be applied as an acceptable supplement data source for presenting a more comprehensive data distribution of geomorphological parameters. By exploring the limitations, we found that the data difference between actual data and combined data is caused by the poor fluvial channel presentation in small catchments of the GFDs and the uncertainties caused by original records in the data combination. The data combination results can be expected to improve when global datasets improve in the future.

Analysis of the resulting database reveals that LDam records are clustered in the mountainous areas, which is similar to those of general landslides (Nadim et al. 2006; Petley 2012; Froude and Petley 2018). The geomorphological parameter relationships show that the scale exponent difference based on landslide volume and landslide area compared well with those identified in the previous landslide research. We have also discovered a potential threshold based on the relationship of landslide volume and river width, which could be useful for exploratory LDam formation risk estimation.

This study is the first attempt to publish a global geolocated LDam dataset and link it with global fluvial datasets as a supplement for data gaps in fluvial parameters. Our research also recommends considering the LDam as an independent hazard type, as the landslides that formed LDam have different geomorphological characteristic, such as H/L ratio and landslide volume, compared with general landslides. More efforts are required to study the specific LDam formation triggers and the geomorphological background that shows susceptibility to LDam formation. Additionally, further analysis such as the evaluation of landslide dam formation susceptibility can be carried out based on this dataset.