1 Introduction

LandslidesFootnote 1 occur all around the world and cause thousands of deaths and billions of dollars in infrastructural damage worldwide every year [1]. However, landslide events are often under-reported and insufficiently documented due to their complex natural phenomena governed by various intrinsic and external conditioning and triggering factors such as earthquakes and tropical storms, which are usually more conspicuous, and hence, more widely reported [2]. Due to this oversight and lack of global data inventories to study landslides, Froude and Petley assert that any attempt to quantify global landslide hazards and the associated impacts is destined to be an underestimation [3].

Existing landslide detection and mapping solutions typically rely on data from ground sensors or satellites. While sensor-based approaches can achieve high accuracy at sub-catchment levels by monitoring land characteristics such as rainfall, altitude, soil type, and slope [4, 5], their global-scale deployment is impractical. Satellite-based approaches can provide more scalable solutions by analyzing Synthetic Aperture Radar (SAR) or optical imagery [6, 7]. However, their deployment can still prove costly and time-consuming. Furthermore, satellite data are susceptible to noise such as clouds.

Using Volunteered Geographical Information (VGI) as an alternative approach, NASA launched a websiteFootnote 2 in 2018 to allow citizens to report about the regional landslides they see in-person or online [8]. Subsequent studies developed other means such as mobile apps to collect citizen-provided data [9, 10]. However, these studies assume active participation of volunteers to collect landslide data and still require time consuming work by specialists directly engaging with the volunteers and interpreting the received data [11].

To alleviate the need for opt-in participation and manual processing, we develop a state-of-the-art AI model that can automatically detect landslides from social media images in real-time. To achieve this goal, we first create a large image dataset comprising 11,737 images from various data sources annotated by domain experts following a data-centric AI approach described by Whang et al. [12]. We then exploit this dataset in a comprehensive experimentation searching for the optimal landslide model configuration (as in [13, 14]). This exploration reveals interesting insights about the model training process. The optimal landslide model achieves an accuracy of 90.6% on the validation set, 87.0% on the held-out test set, and a striking 97.7% when applied on the real-time Twitter image stream in the wild.

Based on this model, we envision a system that can harvest global landslide data and facilitate further research for building global landslide susceptibility maps as suggested in [15, 16].

We make the following contributions:

  • We collected the largest dataset of ground-level landslide images to date.

  • We followed a data-centric AI approach to iteratively improve the quality of the dataset.

  • We conducted the most comprehensive experiments to date for training deep learning models for landslide recognition.

  • We built a prototype system and deployed our landslide detection model in the real-world to assess its performance in the wild.

  • The prototype system offers global scalability by leveraging social media data as a form of passive (i.e., opportunistic) crowdsourcing.

The rest of the paper is organized as follows. Section 2 reviews the relevant literature, Sect. 3 introduces the dataset, Sect. 4 describes the model training experiments, Sect. 5 summarizes the experimental results and findings, Sect. 7 provides a discussion on existing limitations and future work, and finally, Sect. 8 concludes the paper.

2 Related work

The literature on landslide detection and mapping approaches mainly uses four types of data sources: (i) physical sensors, (ii) remote sensing, (iii) volunteers, and (iv) social networks. Sensor-based approaches rely on land characteristics such as rainfall, altitude, soil type, and slope to detect landslides and develop models to predict future events [4, 5]. While these approaches can be highly accurate at sub-catchment levels, their large-scale deployment is extremely costly.

Earth observation data obtained using high-resolution satellite imagery has been widely used for landslide detection, mapping, and monitoring [6, 7]. Remote sensing techniques either use Synthetic Aperture Radar (SAR) or optical imagery to identify landslides following various approaches from image classification [17, 18] and segmentation [19, 20] to object detection [21, 22]. While remote sensing through satellites can be useful to monitor landslides globally, their deployment can prove costly and time-consuming. Moreover, satellite data are susceptible to noise such as clouds.

A few studies demonstrate the use of Volunteered Geographical Information (VGI) as an alternative method to detect landslides [9, 23,24,25]. These studies assume active participation of volunteers to collect landslide data where the volunteers opt in to use a mobile app to provide information such as photos, time of occurrence, damage description and other observations about a landslide event. In order to validate landslide photos collected by the volunteers, Can et al. present an image classification model based on Convolutional Neural Networks (CNN) trained on a relatively small in-house dataset [24]. On the contrary, our work aims to capitalize on massive social media data without any active participation requirement and with better scalability. In addition, we construct a much larger dataset to train deep learning models and perform more extensive experimental evaluations.

Social media data have been used in many humanitarian contexts ranging from general social analytics [26] and geospatial sentiment analysis [27] to incident detection [28] and rapid damage assessment [29], including multimodal approaches [30]. However, its use for landslide detection has not been explored extensively. To the best of our knowledge, no prior work has explored the use of social media imagery to detect landslides. The most relevant studies by Musaev et al. combine social media text data and physical sensors to detect landslides [31, 32]. Specifically, they use textual messages collected through a set of landslide-related keywords on Twitter, Instagram, and YouTube in combination with sensor data about seismic activity and rainfall to train a machine learning classifier that can identify landslide incidents. In this study, we focus on analyzing social media images which can provide more detailed information about the impact of the landslide event. To that end, our work is orthogonal to prior art.

Finally, this paper is different from and complementary to our previous papers [15, 16] in the following ways. In [15], we present a narrative from a practitioner perspective that predominantly highlights existing limitations and challenges in landslide research and proposes a high-level methodology including data collection, processing, and annotation for an AI-based solution without going into technical details of the machine learning aspects of the problem. In [16], we focus on the system engineering aspects where we present building blocks of an online system that can ingest social media data, eliminate duplicate and irrelevant content as well as identify and geolocate landslide reports. We also provide proper latency and throughput benchmark results for each system component. The landslide detection model is covered very briefly in this context. In this paper, on the other hand, we elaborate on all the technical details about the machine learning model development aspects of the problem through an extensive experimentation in search for the optimal model selection and training configuration. To ensure the paper is self-contained, we recapitulate the most relevant parts of our prior works here very briefly.

Fig. 1
figure 1

Example images from the dataset

3 Dataset

To train models that can detect landslides in images, we curated a large image dataset from multiple sources with diverse characteristics. We collected some images from the Web using Google Image search with keywords such as landslide, landslip, earth slip, mudslide, rockslide, rock fall and some images from Twitter using similar landslide-related hashtags. We obtained additional images from landslide specialists captured during field trips. The images obtained from social media or the Web are usually noisy and can include duplicates. Similarly, the images captured during field trips are not always useful for model training. Therefore, the collected data is manually labeled by three landslide experts, who are also co-authors of this study, following a data-centric AI [12] approach that suggests focusing on the data pipeline which typically involves (i) curating a dataset for labeling based on model performance after every iterative cycle to address the model’s specific weaknesses and (ii) significantly increasing performance with a relatively small amount of training data, as elaborated in [15]. Since the AI task at hand is “given an image, recognize landslides” (i.e., no other external information or expert knowledge is available to the AI model), the experts were instructed to keep this computer-vision perspective in mind and label only the most evident cases as “landslide” images (i.e., the images where the landslide is the main theme exhibiting substantial visual cues for the model to learn from). On the other hand, since our ultimate goal is to develop a system that will continuously monitor the noisy social media streams to detect landslide events in real-time, we retained negative (i.e., not-landslide) images that illustrate completely irrelevant cases (e.g., cartoons, advertisements, selfies) as well as difficult scenarios such as post-disaster images from earthquakes and floods in addition to other natural scenes without landslides in the final dataset. The complete dataset creation process includes several rounds of model training, error analysis, expert discussions, and label updates. The final dataset contains 11,737 images. Some example images are shown in Fig. 1. The distribution of images across data sources is summarized in Table 1 and the data splits are presented in Table 2. As suggested by Table 2, only about 23% of the images are categorized as “landslide.” Our dataset is currently the largest dataset for landslide recognition from ground-level images. To assess the quality of the final labels, we measured the inter-annotator agreement using two statistical measures: Fleiss’ Kappa [33] and percentage agreement (observer agreement). Despite the inherent difficulty of the task, the experts achieved an overall Fleiss’ Kappa of 0.58, which indicates an almost substantial inter-annotator agreement. They also achieved a percentage agreement of 76%, which is only slightly below the 80% mark set as a rule-of-thumb by Bayerl and Paul [34].

Table 1 Distribution of images across data sources
Table 2 Data splits (70:10:20)

4 Landslide model

Many computer vision tasks have greatly benefited from the recent advances in deep learning. The features learned in deep convolutional neural networks (CNNs) are proven to be transferable and quite effective when used in other visual recognition tasks [35,36,37], particularly when training samples are limited. Considering we also have limited training examples for data-hungry deep CNNs, we follow a transfer learning approach to adapt the features and parameters of the network from the broad domain (i.e., large-scale image classification) to the specific one (i.e., landslide classification). However, it is often overlooked how complex the transfer learning setup can become with all different possible configurations and hyperparameters to tune for optimal performance. To this end, [13, 14] present exemplary studies on empirical analysis of the impact of different training strategies on the performance of ResNet architecture where they explore training recipes with different loss functions, data augmentation, regularization, and optimization techniques, among others. Inspired by these studies, we conduct extensive experiments where we train several different deep CNN architectures using different optimizers, learning rates, weight decays, and class balancing strategies.

CNN Architecture The CNN architecture (arch) plays a significant role on the performance of the resulting model depending on the available data size and problem characteristics. Therefore, we explored a representative sample of well-known CNN architectures including VGG16 [38], ResNet18, ResNet50, ResNet101 [39], DenseNet [40], InceptionNet [41], and EfficientNet [42], among others.

Optimizer An optimizer (opt) is an algorithm or method that changes the attributes of a neural network (e.g., weights and learning rate) in order to reduce the optimization loss and to increase the desired performance metric (e.g., accuracy). In this study, we experimented with the most popular optimizers, i.e., Stochastic Gradient Descent (SGD) and Adam [43].

Learning rate Learning rate (lr) controls how quickly the model is adapted to the problem. Using a too large learning rate can cause the model to converge too quickly to a sub-optimal solution whereas a too small learning rate can cause the process to get stuck. Since learning rate is one of the most important hyperparameters and setting it correctly is critical for real-world applications, we performed a grid search over a large range of values (i.e., \(\{10^{-2},10^{-3},10^{-4},10^{-5},10^{-6}\}\)).

Weight decay Weight decay (wd) controls the regularization of the model weights, which in turn, helps to avoid overfitting of a deep neural network on the training data and improve the performance of the model on the unseen data (i.e., better generalization ability). In light of this, we experimented with a large range of weight decay values (i.e., \(\{10^{-2},10^{-3},10^{-4},10^{-5}\}\)).

Class balancing An imbalanced dataset can bias the prediction model toward the dominant class (i.e., not-landslide) and lead to poor performance on the minority class (i.e., landslide), which is not ideal for our application. The approaches to tackle this problem range from generating synthetic data to using specialized algorithms and loss functions. Here, we explored one of the basic approaches, i.e., data resampling, where we oversampled images from the landslide class (i.e., sampling with replacement) to create a balanced training set.

Other training details We ran all our experiments on Nvidia Tesla P100 GPUs with 16GB memory using PyTorch library.Footnote 3 We adjusted the batch size according to each CNN architecture in order to maximize GPU memory utilization. We used a fixed step size of 50 epochs in the learning rate scheduler of the SGD optimizer and a fixed patience of 50 epochs in the ‘ReduceLROnPlateau’ scheduler of the Adam optimizer, both with a factor of 0.1. All of the models were initialized using the weights pretrained on ImageNet [44] and trained for a total of 200 epochs. Consequently, we trained a total of 560 CNN models in our quest for the best model configuration.

Table 3 Top-10 configurations based on MCC on the validation set

5 Results

Due to limited space, Table 3 presents results only for the top performing 10 model configurations on the validation set ranked based on Matthew Correlation Coefficient (MCC), which is regarded as a balanced measure for imbalanced classification problems [45] and defined by Eq. 1.

$${\text{MCC}} = \frac{{{\text{TP}} \times {\text{TN}} - {\text{FP}} \times {\text{FN}}}}{{\sqrt {({\text{TP}} + {\text{FP}})({\text{TP}} + {\text{FN}})({\text{TN}} + {\text{FP}})({\text{TN}} + {\text{FN}})} }},{\text{ }}$$
(1)

where TP is the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives. Besides MCC, we also compute common performance metrics such as Accuracy, Precision, Recall, and F1-score as defined by Eqs. 25, respectively.

$$\begin{aligned} \textrm{Accuracy} = \frac{\text {TP}+\text {TN}}{\text {TP}+\text {TN}+\text {FP}+\text {FN}}, \end{aligned}$$
(2)
$$\begin{aligned} \textrm{Precision} = \frac{\text {TP}}{\text {TP}+\text {FP}}, \end{aligned}$$
(3)
$$\begin{aligned} \textrm{Recall} = \frac{\text {TP}}{\text {TP}+\text {FN}}, \end{aligned}$$
(4)
$$\begin{aligned} \textrm{F1} = \frac{2\text {TP}}{2\text {TP}+\text {FP}+\textit{FN}}. \end{aligned}$$
(5)

The top-performing model configuration (i.e., arch: ResNet50, opt: Adam, lr: \(10^{-4}\), wd: \(10^{-3}\), no class balancing) achieves MCC=0.730, F1=0.789, and Accuracy=0.906, all deemed plausible by the specialists. Nevertheless, we investigate the full table of results and identify the following insights:

  • When everything but the optimizer is kept fixed, the models trained with the Adam optimizer outperforms the models trained with the SGD optimizer (179 vs. 100). This confirms the general sentiment that the adaptive and stable nature of the Adam optimizer necessitates less effort to achieve convergence and attain superior training outcomes than the SGD optimizer.

  • Despite the fact that top-performing model is trained without a class balancing strategy, the overall trend indicates that, while everything else is the same, the models trained with class balancing yield better performance than those trained without class balancing (173 vs. 103). This is inline with the general understanding that class balancing can prevent the models from becoming biased toward the majority class, and hence, generate higher accuracy models.

  • ResNet50 architecture tops the rankings among all CNN architectures by achieving the best average ranking as well as the highest mean MCC according to Table 4. Between the ResNet architectures, given that the training dataset is relatively small, ResNet18 offers inadequate capacity for the problem at hand whereas ResNet101 offers potentially more-than-enough capacity which increases the risk of overfitting and hurts the performance. However, the overall differences between architectures do not seem significant except for InceptionNet which yields a significantly poorer performance than others. This is potentially because the InceptionNet architecture generally requires more data to overcome possible overfitting and more computational resources.

  • The impact of the learning rate on model performance shows opposite trends for different optimizers. As per Table 5, smaller learning rates (e.g., \(\{10^{-6},10^{-5},10^{-4}\}\)) seem to work better with the Adam optimizer whereas larger learning rates (e.g., \(\{10^{-2},10^{-3}\}\)) seem to work better with the SGD optimizer. This is because when the SGD optimizer is initialized with a very small learning rate, the training progress becomes very slow and tends to stagnate at a sub-optimal local minimum due to the scheduled learning rate updates at regular intervals. In contrast, the Adam optimizer typically operates better with a smaller learning rate since it ensures a more stable adaptation during training.

  • As expected, the value of the weight decay also impacts the overall performance significantly (in particular, for the Adam optimizer). A large weight decay (e.g., \(10^{-2}\)) hurts the overall performance which tends to improve as the weight decay takes on smaller values (see Table 6). This implies that larger weight decay values cause excessive regularization of the weights, which in turn, reduces the model’s ability to learn properly.

Table 4 Performance comparison of CNN architectures
Table 5 Effect of the learning rate on overall performance
Table 6 Effect of the weight decay on overall performance

To illustrate the effectiveness of the transfer learning approach, we created t-SNE [46] visualizations of the feature embeddings before and after the training of the best-performing model. As shown in Fig. 2, the original ResNet50 model pretrained on ImageNet can distinguish landslide from not-landslide images neither in the training (Fig. 2a) nor in the validation set (Fig. 2b). However, after finetuning the model on the target landslide dataset, the resulting feature embeddings show almost perfect separation of the classes in the training set (Fig. 2c) and a reasonably well separation in the validation set (Fig. 2d).

Fig. 2
figure 2

Feature embeddings before/after model finetuning

When applied on the held-out test set, the best-performing model achieves MCC=0.619, F1=0.701, and Accuracy=0.870 as opposed to MCC=0.730, F1=0.789, and Accuracy=0.906 achieved on the validation set (Table 7). Although the difference in accuracy is relatively small, the difference in MCC and F1 are considerably large due to significant drops in precision and recall of the model on the test set. This phenomenon can be explained by the more-than-twice increase in the false positive (128 vs. 45) and false negative (178 vs. 65) predictions of the model on the test set, potentially as a result of model overfitting to the validation set (Table 8).

Table 7 Performance comparison of the best model on the validation and test sets
Table 8 Confusion matrices for the validation and test sets

To have a better understanding of the inner workings of the model, we investigated class activation maps [47], which highlight the discriminative image regions that the CNN model pays attention to decide whether an image belongs to landslide or not-landslide class. Figure 3 demonstrates example visualizations for all four cases, i.e., true positives, true negatives, false positives, and false negatives. The visualizations for the true positive predictions indicate that the model successfully localizes the landslide regions (e.g., rockfalls, earth slip, etc.) in the images. Similarly for the true negative predictions, the model focuses on areas that do not show any landslide cues, successfully avoiding tricky conditions such as muddy roads, wet surfaces, and natural rocky areas on a beach. However, in both false positive and false negative predictions, we observe that the errors occur mainly because the model fails to localize its attention on a particular region in the image, or is tricked by the image regions that are reminiscent of landslide scenes. This analysis suggests that there is room for improvement where we can train more robust models by enriching the training set with additional hard negative and hard positive images. For instance, we can add more images of forest areas without any landslides to reduce false positives and more images of small-scale landslides to reduce false negatives.

Fig. 3
figure 3

Class activation maps of the model predictions on the test set

6 Real-world deployment

We have developed a proof-of-concept system as presented in [16]. In a nutshell, the system (i) collects live tweets from the Twitter Streaming APIFootnote 4 that match landslide-related keywords in multiple languages, (ii) extracts image URLs from the tweets (if any) and downloads images, (iii) runs the downloaded images through filtering models to eliminate duplicate and irrelevant content, (iv) runs the remaining images through the landslide model to tag each image as landslide or not-landslide, and finally, (v) displays the results on a dashboard for specialists’ examination. The system has collected almost 4.5 million images since its deployment in February 2020. However, only about 30,000 images have been labeled as landslide, which corresponds to less than 1% of the total volume. This indicates the difficulty of the task even though a carefully curated set of landslide-related keywords has been used to collect data from Twitter. To validate the performance of the landslide model in the wild, the specialists reviewed a random subset of the collected images (N=3,600) and assigned ground truth labels. We then re-computed performance scores for the real-world evaluation of the model (Table 9). Satisfactorily, the model achieves a comparable performance to our experiments, and more importantly, generalizes well to a challenging real-world scenario.

Table 9 Evaluation of the real-world performance

7 Discussion

Our experimental results and analytical findings suggest that CNN-based image classification models, when tuned optimally, can be useful for the challenging task of recognizing landslides from images. More importantly, instead of depending on citizen science projects (i.e., active crowdsourcing), we can scale up the solution much more efficiently by relying on passive crowdsourcing and leveraging the information shared in online social media platforms. This ability paves the way for an AI-based automated system that can monitor landslide events around the world, and eventually, reduce human effort and operational cost. Hence, we believe the contributions of the current study will advance the state of art in global landslide data and research. However, we also acknowledge that there are some limitations of the current study. Below we elaborate on the implications of our experimental findings, existing limitations, and our future work in more detail.

On the technical side, it is important to note that our comprehensive experimentation focused exclusively on a selection of CNN architectures. However, transformer-based models, e.g. Vision Transformer (ViT) [48], have recently become more popular and shown to outperform their CNN counterparts in various computer vision tasks. Therefore, it is expected that transformer-based image classification models can lead to better landslide detection performances. Besides, we did not explore thoroughly the effect of stronger data augmentation (e.g., RandAugment [49] and CutMix [50]) and regularization (e.g., label smoothing [41] and dropout [51]) in our current setup to keep the computational workload at a manageable level. Hence, it might be possible to improve the model performance further via stronger data augmentation and regularization techniques, as well. We suggest running an extended experimentation to evaluate state-of-the-art vision transformer models as future work. Another potential extension of our work can be around multimodal modeling of social media text and images together for landslide detection as suggested in [52].

On the application side, despite the fact that social media platforms provide quick access to situational information during time-critical events, we note that a large portion of this data contains irrelevant and redundant information. Therefore, tasking a single model (i.e., landslide model) to sift through all the noise in the social media data alone might not be a plausible system realization. Instead, it is advisable to support the landslide model with other image classification models for filtering out duplicate and irrelevant content, as implemented in [16]. Similarly, current study does not evaluate the authenticity and veracity of the landslide images collected from social media. We believe this requires further investigation through other automatic or manual processes. It is important to reiterate that this work is not intended to be used in isolation during a disaster scenario. As well as the inherent noise within the data content itself, there are inaccuracies that could, in the worst case, hinder rescue operations if not combined with other data sources.

8 Conclusion

In this study, we developed a model that can automatically detect landslides in social media image streams. For this purpose, we first created a large image collection from multiple sources with different characteristics to ensure data diversity. Then, the collected images were assessed by three experts to attain high quality labels through an iterative process of data re-labeling and model retraining as per data-centric AI principles. The collected dataset is currently the largest dataset for landslide recognition from ground-level images. At the heart of this study lied an extensive search for the optimal landslide model configuration with various CNN architectures, network optimizers, learning rates, weight decays, and class balancing strategies. We provided several insights about the impact of each optimization dimension on the overall performance. These insights validated common practices and expectations shared by the community through controlled experiments in one place. The best-performing model achieved plausible performance not only under an experimental setup but also in the wild during a real-world deployment. This underlines the feasibility of our ultimate goal—building a system that leverages social media data as a form of passive (i.e., opportunistic) crowdsourcing to detect landslide reports in real-time and at scale. We believe such a system can contribute to harvesting of global landslide data and facilitate further landslide research. More importantly, it can support global landslide susceptibility maps to provide situational awareness and improve emergency response and decision making.