Introduction

Landslides are a natural process worldwide spread which, despite their local dimension, can cause both casualties and significant socio-economic losses (Diana et al. 2021). Therefore, mapping landslides in space and time has raised much interest for quite some time in the field of geosciences. Over time, many data and methods were developed to identify and map landslides across space and time. Field surveys represent the basis for landslide mapping. Still, this process has depended heavily on detailed topographic maps, aerial images, and specific tools or devices used in the field (geological compass, GPS, etc.) (Qiu et al. 2019; Rodriguez et al. 2021).

The interpretation of aerial and satellite images in combination with manual digitization of points and polygons indicating the landslides’ locations is another widely applied method to map landslides across space and time (Plank et al. 2016; Li et al. 2022; Li et al. 2023). A particular approach in using satellite imagery for mapping landslides is the use of radar satellite imagery (SAR) and interferograms obtained from SAR imagery to generate deformation maps further associated with landslide occurrences (Miele et al. 2022; Wang et al. 2023). Despite the technological development and the availability of SAR and optical imagery, mapping landslides under dense vegetation cover remained a big issue until the last 20 years or so, when the development of Light Detection and Ranging (LiDAR) made possible and more accurate the mapping of landslides on slopes covered by dense vegetation (Jaboyedoff et al. 2012; Görüm 2019; Casagli et al. 2023).

Recently, the gap between the medium-resolution imagery provided by satellites and the high-resolution imagery offered by aerial imagery has been filled by the technological advancement of the unmanned aerial vehicles (UAV) (Niethammer et al. 2012; Ilinca et al. 2022; Șandric et al. 2023). Nowadays, it is possible to use daily high-resolution satellite imagery to map landslides in near-real-time (Bhuyan et al. 2023).

From all of the techniques, the manual interpretation and mapping, the use of machine learning algorithms, and, very recently, the use of deep training algorithms are noteworthy. Although developed relatively recently, deep training algorithms have experienced accelerated development and use in geosciences. Most of the studies carried out so far that used algorithms specific to convolutional neural networks were used to identify landslides and build landslide inventories (Bhuyan et al. 2023; Ghorbanzadeh et al. 2019; Catani 2021). These studies focus on using deep learning algorithms known as image classification and less on object or image instance segmentation. Among the recent studies, Catani (2021) stands out for identifying landslides from crowdsource photos using image classification algorithms.

In general, detailed geomorphological mapping of landslides was primarily employed for geotechnical works and landslide stabilization, using total stations and GNSS devices. This process was time-consuming and almost impossible to implement in remote areas or areas with very steep slopes. Nowadays, high-resolution ortho imagery and DSM/DTM can be easily obtained using small UAVs over small areas (usually areas less than 1 km2). Thus, a significant number of landslides were analyzed through UAV flights in different geological and geographical environments over the world (Niethammer et al. 2012; Stumpf et al. 2013; Lindner et al. 2016; Al-Rawabdeh et al. 2016; Qi et al. 2021; Cheng et al. 2021; Ilinca et al. 2022; Șandric et al. 2023).

Out of all the previous works, less attention was given to mapping landslide features, like cracks, toes, body, scarps, and other morphological features of the landslides. These features, like landslide cracks, are essential in understanding landslides’ evolution across space and time because of their impact on reducing slope stability (Zhang et al. 2012; Krzeminska et al. 2012; Gao et al. 2015; Rogers and Selby 1980) and favouring water infiltration (Damiano et al. 2017; Krzeminska et al. 2012; Khan et al. 2021). At the same time, they are associated with landslide longitudinal stretching (Baum et al. 1998) and landslide kinematics (Yang and Mei 2021; Ilinca et al. 2022). Therefore, their detailed mapping can provide helpful information about landslide kinematics and help predict the short-term evolution of a landslide. In this respect, several papers have approached the mapping of landslide cracks using artificial intelligence and high-resolution imagery: Stumpf et al. (2013) used machine learning algorithms and edge detection convolutions applied to datasets point cloud type to map the dynamics of fissures and fractures within a landslide body; Yang and Mei (2021) used the Single Shot Detector model to automatically identify landslide cracks; Pham et al. (2023) detected ground cracks propagation by applying pixel classification and instance object segmentation on cameras views.

However, to our knowledge, no author has used deep learning to identify and delineate the cracks in landslides. Thus, two semantic image segmentation algorithms (U-Net and DeepLab) have been implemented and tested in the current paper for mapping landslide cracks across high-resolution imagery collected with UAVs. To test the scalability and portability of the trained models, one study area from Romania was selected where very high spatial resolution UAV flights were available. A detailed comparison and discussion about each model’s performance are presented regarding the environmental conditions and the size and spectral response of the landslide cracks.

Materials and methods

Study area

On 3 May 2021, at around 19:00 h, the Livadea landslide was set off by a loud noise created by a large amount of rock and colluvium that was detached from the headscarp. The landslide moved down the slope on a small pre-existing valley (channelized landslide), leaving an imprint of the sliding direction. According to local residents, the sliding movement was gradual, with speeds of only a few meters per day, and it continued until 7 May 2021 morning. The location of the Livadea landslide is characterized by clear landforms (old hummocks) left behind by previous landslides and numerous tilted trees (Ilinca et al. 2022).

The cracks pattern in Livadea landslide is not uniform. They exhibit radial, transversal, and perpendicular orientations to the landslide bodies (Fig. 1). Notably, the crack alignments are influenced by the terrain’s reflectance and colours, with darker hues observed in areas of high soil moisture compared to regions exposed to intense solar radiation and lower soil moisture. Another crucial aspect of the Livadea landslide’s cracks is the intermingling of low vegetation with forest vegetation, including fallen or partially upright trees, bushes, and grass.

Fig. 1
figure 1

Location of the study area in Romania: a overview of Livadea landslide’s cracks pattern: b scarp; c body; d toe

The Livadea landslide’s cracks exhibit a distinct pattern across its scarp, body, and toe (Fig. 1). The scarp, the upward-facing slope of the landslide, features deep, vertical cracks that reflect the intense shearing forces that generated the landslide. The body, the main portion of the landslide, exhibits a network of cracks with varying orientations, reflecting the complex deformation processes that have occurred within the landslide mass. The toe, the downward-facing slope of the landslide, is characterized by a more diffuse pattern of cracks, indicating the gradual dissipation of residual stresses near the landslide’s terminus.

This distinct crack pattern provides valuable insights into the evolution and development of the Livadea landslide. The deep, vertical cracks at the scarp suggest that the landslide was rapidly and violently displaced. The complex network of cracks in the body indicates that the landslide has undergone extensive deformation (Ilinca et al. 2022).

Data collection

The landslide cracks were manually digitalised on high-resolution orthoimages (3 cm GSD) obtained from UAV flights. Livadea landslide has been flown using a DJI Phantom 4 RTK UAV equipped with a 20MP RGB camera. In total, 421 cracks were digitalized across the entire landslide. Because the cracks’ size and density differ across the landslide body, chip size tuning was necessary to estimate the optimal sizes objectively. Tests were performed with tile sizes from 64 to 512 pixels and a step of 64 pixels, having in total eight datasets. Because the neighbourhoods of the cracks are different, from grass to bushes, trees, or bare soil, tests were performed with tiles having masked and non-masked cracks (Fig. 2).

Fig. 2
figure 2

Samples exported with different tile sizes (64, 256, and 512 pixels), masked and non-masked cracks; the left-hand column indicates the non-masked cracks, and the right-hand column indicates the masked cracks

Training a model on masked samples offers several advantages. Firstly, it enhances generalization to non-masked images by encouraging the model to focus on essential detection features. This proves especially valuable in applications where masks may be imperfect, such as in real-world surveillance. Secondly, such training enhances the model’s robustness to changes in object appearance, as it learns to recognize intrinsic features rather than relying solely on masks. This adaptability is beneficial in scenarios where objects may be obscured or occluded. Moreover, training on masked samples proves to be more efficient, providing the model with additional information about detected objects. This accelerates learning and contributes to improved performance. Additionally, utilizing masked samples reduces the volume of data required for training, as the masks convey essential information, particularly advantageous in situations where data is limited.

Training and inferring

Two deep learning convolutional neural networks (CNNs) were implemented and compared in the current case study: U-Net and DeepLab. The U-Net architecture (Ronneberger et al. 2015), with its distinctive U-shape, comprises an encoder and decoder, ideal for pixel-wise tasks like image segmentation. The encoder reduces spatial dimensions through convolutional layers with max-pooling, capturing features at different scales. The bridge connects encoder and decoder, preserving spatial context. The decoder upsamples feature maps and incorporates encoder information through skip connections, recovering fine details and object boundaries lost during downsampling. The output layer produces a segmentation mask assigning class labels to pixels. DeepLab (Chen et al. 2018), a state-of-the-art model by Google Research, addresses semantic image segmentation. It introduces atrous convolutional layers with varied dilation rates to capture multi-scale features, improving segmentation accuracy without increasing parameters or complexity.

Both U-Net and DeepLab employ ResNet architecture (He et al. 2015) (ResNet-50 for U-Net, ResNet-101 for DeepLab). ResNet addresses the vanishing gradient problem in deep networks through residual blocks. These blocks learn residuals between input and output, aiding gradient flow with skip connections. ResNet’s extreme depth, facilitated by skip connections, improves training efficacy but demands substantial GPU RAM. Batch normalization, often included in ResNet, stabilizes and accelerates training by normalizing layer activations.

Validation metrics

Validation metrics are essential tools for evaluating the performance of deep learning models and ensuring that they generalize well to unseen data (Șandric et al. 2022). They are typically used in conjunction with training data and testing data to assess the model’s ability to make accurate predictions on new data. The choice of validation metric depends on the specific task and the nature of the data. For classification tasks, accuracy is often the most straightforward metric to use, but other metrics like precision, recall, and F1-score can provide more nuanced insights into the model’s performance. Precision is calculated by dividing the number of correctly detected landslide cracks by the sum of true positives and objects falsely detected as landslide cracks. Recall is calculated by dividing the number of true positive detections by the sum of true positives and the number of objects falsely detected as landslide cracks in areas without such cracks. The F1-score measures the model’s overall accuracy, and it is computed by taking the product of precision and recall and dividing it by their sum (Pleșoianu et al. 2020; Pedregosa et al. 2012). The F1-score gives more weight to the lower of the precision and recall scores, which makes it a more robust metric for evaluating the overall performance of an object detection model.

Results and discussions

Landslide cracks detection was performed using U-Net and DeepLab deep learning models using masked and non-masked training samples with different tile sizes. The results are evaluated and discussed based on precision, recall, and F1-score validation metrics (Table 1).

Table 1 Metrics for U-Net and DeepLab training on detecting landslide cracks

U-Net has the lowest precision, recall, and F1-score for the tile size of 64. The validation metrics are gradually increasing up to over 0.9, with the increase in tile sizes. The same evolution of the validation metrics is observed while using the masked training samples, reaching the highest precision of approximately 0.9301 for a tile size of 512 is achieved.

DeepLab validation metrics follow a similar trend as the U-Net one, having the lowest precision, recall, and F1-score for the smallest tile size of 64 pixels. With the increasing of tile sizes, the validation metrics are also increasing, reaching very similar values of approximately 0.9 and above for sizes higher than 256 pixels.

The detection and mapping of landslide cracks, presented in Figs. 4 and 5, clearly support the metrics presented in Table 1 and Fig. 3, with cracks better detected as the tile sizes increase.

Fig. 3
figure 3

Influence of tile sizes over the landslide crack detection metrics for U-Net and DeepLab

One of the most striking observations from the results is the significant impact of tile size on the segmentation performance of both U-Net and DeepLab (Figs. 4 and 5). As the tile size decreases from 512 to 64, there is a general trend of decreasing performance metrics, such as precision, recall, and F1-score. This trend suggests that working with higher-tile sizes tends to yield better results, as they incorporate more of the natural pattern of the landslide cracks, fact that has been reported for detecting the landslide bodies (Bhuyan et al. 2023; Ghorbanzadeh et al. 2019; Meena et al. 2023). Also, larger tile sizes provide the models with more contextual information, allowing them to make more accurate predictions. In contrast, smaller tiles provide limited context, making it challenging for the models to capture the complete structure of objects or regions in the images.

Fig. 4
figure 4

Landslide cracks detection using U-Net and various tile sizes. From top to bottom: a the location of the landslide body and the location of each example from below; b, c, and d the location of the examples with landslide cracks; e, f, and g the landslide cracks detection using a tile of 64 pixels; h, i, and j the landslide cracks detection using a tile of 256 pixels; k, m, and n the landslide cracks detection using a tile of 512 pixels

Fig. 5
figure 5

Landslide cracks detection using DeepLab and various tile sizes. From top to bottom: a the location of the landslide body and the location of each example from below; b, c, and d the location of the examples with landslide cracks; e, f, and g the landslide cracks detection using a tile of 64 pixels; h, i, and j the landslide cracks detection using a tile of 256 pixels; k, m, and n the landslide cracks detection using a tile of 512 pixels

The presence of masks in the segmentation task introduces additional constraints and information for the models to consider and show improved precision and recall values. U-Net demonstrates a higher overall performance, achieving better precision, recall, and F1-score compared to DeepLab. U-Net’s strong performance in precision indicates its ability to accurately identify and segment regions of interest, while its recall remains competitive. DeepLab, on the other hand, excels in recall, implying that it is better at capturing a larger portion of true positive regions, although it does so at the cost of precision. The F1-scores for both models are generally lower than their corresponding precision and recall values, indicating a trade-off between the two.

U-Net continues to outperform DeepLab in terms of precision, maintaining its accuracy in identifying and segmenting regions of interest. DeepLab’s recall remains high, emphasizing its ability to capture a significant proportion of true positive regions. Still, the difference between masked and unmasked samples is not that high and both models can be used for future implementations. The F1-scores for both models in the “With Mask” scenario are generally higher than in the “Without Mask” scenario. This suggests that the models achieve a better balance between precision and recall when additional constraints, such as masks, are provided.

The choice between U-Net and DeepLab may depend on the specific requirements of the segmentation task. If precision is of paramount importance, U-Net may be the preferred choice. If capturing a high proportion of true positive regions is the primary goal, DeepLab’s higher recall may be advantageous.

Analyzing the minimum and maximum values within the results provides valuable insights into the extreme performance cases for each configuration. For U-Net, the minimum precision (0.7486) suggests that, in the worst-case scenario, it can produce a substantial number of false positive segmentations, which may not be very accurate. Conversely, the maximum precision (0.9301) highlights its exceptional accuracy in identifying and segmenting regions of interest when conditions are favourable.

DeepLab’s lowest recall (0.5284) in the “Without Mask” scenario implies that it misses a significant portion of true positive regions during segmentation under specific conditions. However, its highest recall (0.9997) in the “With Mask” scenario indicates its ability to excel in capturing nearly all true positive regions when given the right constraints.

The minimum F1-score for U-Net (0.6951) indicates a scenario where it struggles to achieve a good balance between making accurate positive predictions and capturing a reasonable proportion of the true positive regions. In contrast, the maximum F1-score for U-Net (0.9621) demonstrates its ability to strike an excellent balance between precision and recall under favourable conditions (Yang and Mei 2021; Ghorbanzadeh et al. 2019; Catani 2021). The F1-score is consistently higher than the precision and recall scores. This is because the F1-score is a harmonic mean of the precision and recall scores, which means that it gives more weight to the lower of the two scores. This makes the F1-score a more robust metric for evaluating the overall performance of an object detection model.

Overall, as seen in Fig. 3, increasing the input image size can improve the performance of a deep learning model for image segmentation (Yang and Mei 2021). However, there is a point beyond which further increases in image size do not lead to significant improvements in performance (Bhuyan et al. 2023; Catani 2021). Additionally, the F1-score is a more robust metric for evaluating the overall performance of an object detection model (Pleșoianu et al. 2020).

Conclusions

In conclusion, the results of this comparative analysis between U-Net and DeepLab for image segmentation shed light on mapping landslide cracks. While U-Net generally outperforms DeepLab, the choice between these models should be based on the specific requirements of the segmentation task, the importance of precision versus recall, and the trade-offs between these metrics.

Understanding the influence of tile size is crucial, as larger tile sizes tend to lead to better segmentation results due to increased contextual information. Both methods perform poorly on the smallest tile size (64 × 64). This is because the small tile size does not provide enough context for the segmentation process. The performance of both methods is relatively stable for tile sizes between 128 × 128 and 512 × 512. This suggests that these tile sizes are a good compromise between accuracy and efficiency. Additionally, the presence of masks can enhance the overall performance by providing additional constraints and context for the models.

Further research may involve exploring how these models perform with different data types, dataset sizes, and real-world scenarios to tailor the choice to the specific needs of a project.