Machine learning and landslide studies: recent advances and applications

Upon the introduction of machine learning (ML) and its variants, in the form that we know today, to the landslide community, many studies have been carried out to explore the usefulness of ML in landslide research and to look at some classic landslide problems from an ML point of view. ML techniques, including deep learning methods, are becoming popular to model complex landslide problems and are starting to demonstrate promising predictive performance compared to conventional methods. Almost all the studies published in the literature in recent years belong to one of the following three broad categories: landslide detection and mapping, landslide spatial forecasting in the form of susceptibility mapping, and landslide temporal forecasting. In this paper, we present a brief overview of ML techniques, provide a general summary of the landslide studies conducted, in recent years, in the three above-mentioned categories, and make an attempt to critically evaluate the use of ML methods to model landslide processes. The paper also provides suggestions for future use of these powerful data-driven techniques in landslide studies.


Introduction
Landslides are the gravity driven motion of a mass of rock, soil and debris down a slope, and they can cause significant fatalities and economic losses. According to the World Bank, about 3.7 million square kilometers of inland area on earth is prone to landslide 2 Machine learning 2.1 Background ML algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. The term Machine Learning is attributed to Arthur Samuel, a pioneer in the field of computer gaming and artificial intelligence, who coined it in 1959. In his article (Samuel 1959), he said "Two machine-learning procedures were investigated using the game of checkers. The main idea was that a computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program. Furthermore, it can learn to do this in a remarkably short period of time … when given only the rules of the game…and a redundant and incomplete list of parameters which are thought to have something to do with the game, but whose correct signs and relative weights are unknown and unspecified." Mitchell (1997) provides a formal definition of ML as follows: "A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." One of the presumed traits of ML in comparison with humans is its comparable or superior predictive and decision-making performance. What a typical "learning machine" does is finding a rule that, when applied to a collection of inputs, produces the desired outcome. This rule also generates the correct outcome for most other inputs (distinct from the training data) on the condition that those inputs come from the same or a similar statistical distribution as the one the training data was drawn from. It can be argued that such a process is not necessarily learning (Burkov 2019) in the way humans learn, because if the inputs are slightly changed, the outcome can become completely wrong. For instance, if a machine learning algorithm is trained by "looking" at landslide images in vegetated areas, unless it is also trained to recognize landslides in bare lands, it may fail to identify such landslides. Therefore, it is reasonable to conclude that, as of today, machine learning cannot outperform humans in many fields. However, given the pace of ML/DL advances, it is probable that in the future it will revolutionize the processes of prediction and decision making, and also remarkably influence practice and research in landslide risk assessment and management.

Prediction versus explanation
There has been a long debate among statisticians about the scientific value of predictive models versus explanatory and descriptive models (e.g. , Geisser 1975;Wallis 1980;Breiman 2001;Parzen 2001;Feelders 2002;Shmueli 2010). This debate has been further intensified by the emergence of ML techniques in the computer science community as powerful predictive methods compared to classical statistics-based methods. As a result, according to Breiman (2001), it can be argued that there are, at least, two cultures in data-driven analysis, namely, data modeling and algorithmic modeling, with the former aiming to gain information from data in order to predict, and the latter treating the data mechanism as unknown and only aiming at maximizing the accuracy of the predictions. As inferred from Shmueli (2010) in her thorough discussion on the difference between explanation and prediction, data modeling as explanatory methods aim to provide the truth, whereas algorithmic modeling as predictive methods aim to provide the reality based on the available data. Most ML methods fall primarily on the side of algorithmic modeling, and therefore prediction. Given these distinctions, hereinafter, we adopt the following criteria to qualify landslide studies as ML-based studies, and to distinguish them from statistics-based studies.
• Accurate prediction is the main goal of the study. Therefore, we deliberately excluded explanatory models, such as linear statistics-based methods that are used for statistical inference. • Data are divided into training and testing datasets, and evaluating the trained model on the testing dataset is the major method for assessing the performance of the model.

Conventional machine learning versus deep learning
In general, DL can be regarded as a subset of ML. The main distinction between DL and conventional ML algorithms falls in the modality of learning from data. Besides, DL is primarily (so far) based on artificial neural networks (ANN), whereas conventional ML methods include algorithms other than, and as well as, ANN. In conventional ML algorithms, labeled or unlabeled data come along with certain features or attributes. It might be necessary that the analyst reduces or increases these features, depending on data quantity and the utilized ML algorithm. Through the training process, a conventional ML algorithm learns to find patterns in the data based on its available features.
In DL, the input data (e.g., image, text, video or time series) is directly sent to artificial neural networks, where each network hierarchically learns specific features of the input data. The learned features are then used to find a pattern that associates the input data to a specific label, to a distinct category or to a decision. In general, DL algorithms typically require more data for training than conventional ML algorithms, given their higher number of hyperparameters.

Learning methods
Supervised, unsupervised (semi-supervised can be seen as a mix of the two), and reinforcement learning are the major ML methods.
Supervised learning algorithms are used on data that consist of a set of inputs (predictors, independent variables or features) and their corresponding outputs (target variables or labels). Training on input variables and target variables, the machine learns how to map inputs to corresponding outputs. The training process continues until the model achieves a desired level of accuracy on the training data. The validity of the model is assessed by evaluating the model on unseen data (test set). Examples of supervised learning algorithms include Decision Trees and Trees ensembles (e.g., Random Forest, Gradient Boosting algorithms such as AdaBoost, XGBoost, etc.), support vector machines and artificial neural networks including multi-layer perceptron neural nets and supervised DL algorithms.
In unsupervised learning, there is no target or label variable to predict and the goal is to "make sense" of the data. A conventional use of unsupervised learning is clustering populations of data in different groups for specific interventions. Examples of this type of unsupervised learning algorithms are hierarchical clustering, K-means, Density-Based Spatial Clustering of Applications with Noise (DBSCAN). DL algorithms can also be used for unsupervised learning. Generative DL algorithms such as autoencoders and generative adversarial networks (GANs) belong to this group.
In reinforcement learning (RL), the machine is trained to make specific decisions. An agent is exposed to an environment where it takes actions to maximize the cumulative reward (usually in episodic problems and is called return) or average reward (in continuing problems) that concerns multiple steps ahead. The agent learns from past experience and tries to capture the best possible knowledge to make accurate decisions. Markov Decision Process and Deep Reinforcement Learning are some examples of RL methods.

Background
Landslide detection or mapping refers to the delineation of landslide-affected areas, which include the source and the deposition zones of the moving soil or rock mass. Landslide detection is an important part of emergency response to extreme events, such as extreme rainfall and strong earthquakes, to identify hazardous areas affected by slope failure, where field surveys can be expensive, cumbersome, dangerous and involve access difficulties. Landslide detection is also useful for building landslide geomorphological inventories (historical, event-based, seasonal or multi-temporal), which help understanding the causal factors of past landslides (Guzzetti et al. 2012), and can help monitoring, predicting and mitigating future landslides. Before the widespread use of satellite remote sensing data, landslide detection was essentially done using visual inspection of aerial photographs or field surveys, a time consuming and expensive process. Mondini et al. (2011) estimated that the manual production of an event-based landslide inventory requires about 5 days per person per square kilometre, including interpretation of aerial photographs, field surveys, digitization of information and creating a geographical database. More information on landslide detection methods can be found in the review paper by Guzzetti et al. (2012).
In the past decade and by provision of high volumes of medium to high resolution satellite and airborne-based imagery, ML techniques have become attractive choices for landslide detection. The main goal in the application of ML algorithms for landslide detection is to enable the machine to detect landslide features, such as scarp and run-out track, in a similar way to humans finding these features in a set of images. This is possible primarily because a landslide makes a contrast, especially in vegetated areas, with the surrounding area by exposing fresh rock and soil, causing local change in brightness of the image. Therefore, what ML algorithms aim to achieve is a human-level capability in landslide feature detection. It is noted that landslide detection, using remote sensing data does not inevitably mean the use of ML methods, and indeed many remote sensing data are currently analysed and processed using manual and rule-based methods that require greater involvement of domain experts and setting area specific thresholds.
As shown in Fig. 1a, conventional ML methods (CML), in form of pixel-based and object-based landslide detection, were more popular in the early landslide detection studies (e.g., Borghuis et al. 2007;Danneels et al. 2007;Chang et al. 2007Chang et al. , 2010Gong et al. 2010;Martha et al. 2011;Van Den Eeckhaut et al. 2012), whereas in the past few years the interest has grown towards using DL methods (e.g., Ding et al. 2016;Chen et al. 2018a, b;Ghorbanzadeh et al. 2019a,b, c;Can et al. 2019;Bui et al. 2020a, b;Prakash et al. 2020), with some studies comparing different methods for the same test area. It should be noted that CML in landslide detection can be supervised, unsupervised or combination of the two methods. DL methods in landslide detection (up to the time of this literature review) fall primarily under the category of supervised methods.
Landslide detection is typically done either using change detection between pre-and post-landslide images or solely using feature detection in post-landslide images. In both 1 3 cases, as shown in Fig. 1b, ML algorithms have been applied considering CML and DL as supervised learning methods (e.g., Danneels et al. 2007;Chang et al. 2010;Chen et al. 2014;Mora et al. 2018;Chen et al. 2019;Prakash et al. 2020) and, to a lesser extent, unsupervised learning methods (e.g., Martha et al. 2011;Li et al. 2016;Keyport et al. 2018) or combined supervised and unsupervised learning (e.g., Borghuis et al. 2007;Fang et al. 2020). In Fig. 1, number of "All" articles in Fig. 1 Trend in application of ML algorithms in landslide detection studies (a) use of conventional ML in pixel-based methods (CML-PB), conventional ML in object-based methods (CML-OB) and DL methods, (b) use of supervised, unsupervised and combination of the two methods each year is not necessarily the same as the summation of articles in each ML category as some papers consider multiple ML methods.
The identified studies have been performed in various countries across the globe as shown in Fig. 2, with China and Hong Kong being the geographical areas with most case studies. In 3 cases (i.e., Global and Search Engine), the studies used landslide data across the globe or from landslide images collected using search engines to train landslide detection algorithms.
Given the spatial extent of landslides, remote sensing technologies, including Earth Observation satellites and airborne sensors mounted on aircraft and unmanned aerial vehicles (UAV), are widely used for landslide detection. These technologies result in various data sources which typically involve medium to high and very high resolution optical, multispectral, LiDAR (light detection and ranging) and radar data. These include airborne LiDAR DEM data (e.g. Van Den Eeckhaut et al. 2012;Pawluszek-Filipiak & Borkowski 2020;Prakash et al. 2020), UAV-based optical imagery (e.g. Lei et al. 2019a, b;Catani 2021), satellite-based Synthetic Aperture Radar (SAR) data (e.g. Kamiyama et al. 2018;Mabu et al. 2020), satellite-based medium resolution multi-spectral data (e.g. Prakash et al. 2020) and satellite-based high (e.g. Bacha et al. 2020;Tavakkoli-Piralilou et al. 2019) and very high resolution multi-spectral data (e.g. Cheng et al. 2013).

Methods
Landslide detection methods are a special application of the characterization of land cover, and its change, for which there has been increasing scientific and practical interest in the remote sensing community. These methods, in general, fall within two interrelated categories: pixel-based and object-based methods.
Pixel-based landslide mapping examines each pixel in the single-band or multi-band image and determines whether it belongs to a landslide or not. This is done first by treating all input features (e.g., morphological and spectral features) as raster layers (bands), which are co-registered and re-sampled to a chosen resolution. Then feature values are extracted at given pixels and further examined to decide whether or not they represent features of a landslide.
In object-based methods, which can be seen as a subset of pixel-based methods, the spatial connection of neighbouring pixels is used to identify objects, in single-band or multiband images, and then the objects are examined for determining if they are landslide or non-landslide segments. Figure 3 illustrates the landslide detection and mapping methods using CML and DL as well as ruled-based and other data-driven approaches. In all these cases, change detection and feature detection methods can be used based on pre-and post-landslide images or only post-landslide images.

Pixel-based methods using CML
Pixel-based landslide detection is performed with pixels as input. In a digital image, a pixel is the basic constituent element. In general, pixel-based methods often require extensive parametric tuning and precise geometrical correction or co-registration to be applicable to large areas (Sameen & Pradhan 2019).
Based on the published works in this area, pixel-based CML covers a range of studies that use both supervised and unsupervised methods. In pixel-based supervised classification, the pixels are labelled as landslide or non-landslide by the landslide experts, and then the labelled pixels along with the corresponding signatures from bands of input images (see the left-hand side of Fig. 3) are used to train ML algorithms. Besides direct change detection methods, some authors (e.g., Si et al. 2018) used susceptibility analysis as the basis for landslide change detection by using areas with high susceptibility as candidates for applying the derived change detection thresholds for identifying new landslides.
Unsupervised classification is typically used to cluster pixels in a dataset based on their similarity with other pixels, without any user-defined label. The main limitation of unsupervised classification is that the output needs to be interpreted and manually assigned a label. K-means clustering, Gaussian Mixture Models (GMM), Markov Random Field and hierarchical clustering are the unsupervised learning algorithms in landslide detection studies considered herein (Martha et al. 2011;Cheng et al. 2013;Li et al. 2016;Keyport et al. 2018  Combining unsupervised and supervised learning methods, Borghuis et al. (2007) used a two-step unsupervised approach for landslide detection in Taiwan following typhoons Mindulle and Aere in 2004. First, they used K-means clustering for deducing spectral signatures of pixels of optical satellite images (5-m resolution SPOT-5) and then used supervised classification with Maximum Likelihood Classifier (MLC) for classification of K-means labels. Furthermore, Borghuis et al. (2007) used MLC for classification of manually labelled pixels containing spectral features from optical images and associated DEM for landslide detection.
Being a classification problem, almost all supervised CML studies in landslide detection use metrics derived from the confusion matrix for evaluating the performance of the models on test sets. Kappa coefficient and Area Under the Receiver Operating Characteristic curve (AUC) have also been used for model performance evaluation. Table 1 summarises the main features of pixel-based CML studies for landslide detection (see Appendix 1 for meaning of acronyms). It should be noted that this list only considers studies that used CML as the primary approach for pixel-based landslide detection and do not include studies where CML pixel-based methods were used for comparison with other methods.

Object-based methods using CML
Object-based methods using CML fall within the framework of Object-Based Image Analysis (OBIA) that includes two major steps: (1) image segmentation, and (2) classification of the emerged segments. ML methods can be applied in both steps. While OBIA offers extra features to distinguish landslides from other objects, it needs to optimize segmentation parameters (e.g. scale) (Myint et al. 2011) and thus the degree of automation is low compared to pixel-based methods (Sameen & Pradhan, 2019). CML methods combined with an OBIA framework have drawn the attention of the geo-informatics community for landslide detection. To this aim, both supervised and unsupervised CML have been used, and sometimes combined, under the OBIA framework.
Supervised CML methods have been frequently used in segment classification of OBIA for landslide detection. While the segmentation step in OBIA is typically performed using multi-resolution methods implemented in commercial geo-spatial analysis packages, some authors used unsupervised learning algorithms for segmentation of optical remote sensing images. For instance, in a two-step OBIA-based landslide detection in India using 5.8-m resolution multi-spectral satellite data, Martha et al. (2011) adopted K-means clustering for objective thresholding of multi-resolution image segmentation before running classification on the final segments.
The use of both supervised and unsupervised learning methods in OBIA was studied by Cheng et al. (2013), who suggested an object-based framework built on computer vision (Bag of Visual Words, BoVW) and text mining methods (probabilistic latent semantic analysis, pLSA) for detecting landslides. At the heart of these methods were K-means clustering in BoVW for clustering the pixels into visual words and kNN in pLSA. They trained and tested this approach for an area in China using 1-m resolution multi-spectral satellite data (Geoeye-1). Table 2 summarises the main features of object-based CML studies for landslide detection. Figure 4 illustrates the difference of pixel-based and object-based methods with regard to the application of CML algorithms.

DL methods
DL methods are mostly used in the context of computer vision in landslide detection studies. Unlike conventional ML methods, DL methods do not need extensive feature engineering in the preparation of the training dataset. However, in general, DL methods require more training data than conventional ML methods, given the higher number of model variables (thousands to millions) that need to be fit. This limitation is typically taken care of by data augmentation methods that involve rotation and flipping of the original images. In our literature review, we identified one work that used DL in an application outside computer vision. In this work Mezaal et al. (2017) combined fuzzy-based image segmentation (object-based) with Multi-Layer Perceptron (MLP) and Recurrent Neural Networks (RNN), a DL method, for landslide detection in Cameron Highlands in Malaysia. They used a LiDAR point cloud with a point density of 8 points/m 2 to derive 0.5 m resolution DEM for acquiring topographic features. In computer vision applications, DL methods can be categorized into three groups, which by order of complexity are: (1) image classification, (2) object detection, (3) semantic segmentation. In image classification, the goal is to find the label of the image (e.g., landslide or non-landslide). In object detection, the aim is to identify and locate the objects that are present in an image, with the help of bounding boxes. Image semantic segmentation further moves forward, by trying to find out accurately the exact boundary of the objects in the image. In semantic segmentation, each pixel in an image is assigned to a certain class, and hence this can be thought of as a classification problem per pixel. DL-based  semantic segmentation in landslide detection is mainly about binary semantic segmentation of images at pixel level. Based on the identified landslide detection studies, it can be inferred that landslide detection using deep learning has been performed primarily either as an image classification (whole image or image patches) or as semantic segmentation.

DL for image classification
Image classification in landslide detection is mainly limited to classifying images as landslides or non-landslides. In DL-based image classification, it is very customary to use wellknown algorithms pre-trained on massive datasets, for classification of images that are not found in those datasets. For instance, Catani (2021) used four pre-trained top performer CNN algorithms using transfer learning to train a general-purpose landslide detection from UAV and ground-based RGB (Red-Green-Blue bands) photographs found through search engines. The four pre-trained algorithms were: GoogLeNet (Szegedy et al. 2015); Goog-LeNet-Places365 (Zhou et al. 2018b), a modified version of GoogLeNet specifically oriented towards the classification of the scene rather than single objects; ResNet.101, a 101layer CNN with improved training curve based on residual learning (He et al. 2015); and Inception.V3, for classification of multi-purpose images in near real time (Szegedy et al. 2016). Table 3 summarises the main features of DL-based image classification studies for landslide detection.

DL for patch-wise image classification
Patch-wise image classification involves splitting the original image into multiple square patches with a width much smaller than the original image width/height and then labelling these patches as landslide or non-landslide to be used for training a CNN model or a variant of it. Once the CNN model is trained on landslide and non-landslide patches, it can be run on patches of an image and each patch can be labelled. All recognized patches put together show the extent of the landslide. In one of the first applications of DL in landslide detection, Ding et al. (2016) performed it using patch-wise image classification (patch size = 28 pixels) for landslides occurred in 2015 in Shenzen, China. In a more recent work, Ghorbanzadeh et al. (2019b) compared machine learning methods ANN, SVM and RF (pixel-based) with different CNN-based patch-wise classification for landslide detection in Rasuwa district in Nepal. For CNNs, they used multiple square window (patch) sizes 12, 16, 22, 32, and 48 pixels in an image classification framework and found that in general smaller window size resulted in more accurate results. Their conclusion was that CNNs did not automatically outperform ANN, SVM and RF, and that the performance of CNNs strongly depended on their design, i.e., layer depth, input window sizes and training strategies. Table 4 summarises the main features of these studies.

DL for semantic segmentation
Semantic segmentation methods rely on pixel-wise classification. Semantic segmentation using innovative CNN architectures has gained momentum in the past few years. An example of DL methods for semantic segmentation is a fully convolutional network (FCN)   (Long et al. 2015) which uses a convolutional neural network to transform image pixels to pixel categories. Unlike the convolutional neural networks previously introduced, FCN transforms the height and width of the intermediate layer feature map back to the size of input image through the transposed convolution layer, so that the predictions have a oneto-one correspondence with input image in spatial dimension (height and width). Given a position on the spatial dimension, the output of the channel dimension will be a category prediction of the pixel corresponding to the location. U-Net is another popular CNN architecture used for semantic segmentation since it requires fewer images for training compared to conventional CNN architecture with multiple consecutive layers. Konishi & Soga (2019) used U-Net for landslide detection from pre-and post-event SAR images of the 2018 Hokkaido Eastern Iburi earthquake in Japan. They used input images of 256 pixels by 256 pixels and showed that their approach reached a better performance compared to threshold-based SAR image analysis.
To compare DL with conventional ML algorithms, Prakash et al. (2020) implemented deep learning semantic segmentation, OBIA and pixel-based algorithms for spatial mapping of hillslope landslides in the State of Oregon, USA. They used high resolution LiDAR-based DEM and Near-Infrared band of Sentinel-2 post-landslide data. The deep learning algorithm used was based on U-Net CNN with ResNet blocks, which was used for semantic segmentation and subsequent classification. Prakash et al. (2020) confirmed the observation by Ghorbanzadeh et al. (2019b) about different ML algorithms and showed that all the three methods were able to map the landslides in the testing area (with about 80% accuracy but lower recall scores), with the DL methods performing slightly better than the other two conventional methods. Other studies that used U-Net and ResNet architecture include Qi et al. (2020), and Liu et al. (2020a, b, c).
In regions that undergo land changes other than only landslides, it is difficult to separate landslides from other land changes. To address this limitation inherent to conventional approaches for landslide detection, Fang et al. (2020) used GAN-based Siamese framework (GSF) for landslide inventory mapping. The framework comprised two cascading modules, namely, a domain adaptation module based on conditional GANs and a landslide detection module based on Siamese neural network. The domain adaptation module aims to make a cross-domain mapping between pre-landslide and post-landslide images with adversarial learning to generate a pre-landslide image as close as possible to the post-landslide image in terms of contextual image properties (lighting, atmospheric conditions, etc.) at the time of the post-landslide image. It was designed to retain only changes due to landslide activities in the generated image. The detection module aims to perform pixel-level landslide detection on the pairs of generated pre-landslide image and original post-landslide image with a Siamese neural network model. The Siamese network is used to generate an output image that reflects how similar are the pair of input images, thus identifying and detecting landslide regions. Table 5 summarises the main features of semantic segmentation studies for landslide detection.

Spatial forecasting
Literature studies addressing landslide spatial forecasting estimate where future landslides are likely to occur in a target region, without considering when or how frequently they will occur. In other words, data-driven methods including ML algorithms are frequently used to compute landslide susceptibility, i.e. the "likelihood of a landslide occurring in a given  area" (Brabb, 1984), relying on two standard key assumptions, common to statisticallybased and ML approaches developed for landslide susceptibility analysis and zoning (Varnes, 1984;Reichenbach et al. 2018;Lombardo et al. 2020): (i) future landslides are more likely to occur under conditions that led to slope instabilities in the past; (ii) conditions that are directly or indirectly linked to slope failures can be collected and used to build predictive models of landslide spatial occurrence. Differently from statistical analyses, ML algorithms are able to learn the association between landslide occurrences and landslide conditioning factors without necessarily assuming a structural model in the data. The learning aspect of these methods is to develop sequences of commands or algorithms that search, in a process of iterative and gradual refinement, for associations in the data that basic descriptive statistics and the human eye may not readily detect as such (Korup and Stolle 2014).
A very recent overview of the most popular machine learning techniques available for landslide susceptibility studies is presented by Mergadi et al. (2020), who also state that "only a handful of researchers use machine learning techniques in landslide susceptibility mapping studies." Indeed, they identify ten authors who are responsible for approximately 47% of published landslide susceptibility studies adopting neural networks, 70% of studies adopting random forest (RF) algorithms, 83% of studies adopting decision tree (DT) algorithms, and 86% of studies adopting support vector machines (SVM) algorithms. This finding prompted us to develop the literature review for this section by mainly discussing the most recent studies from these authors, within which a comparison among different ML techniques has been performed. Nevertheless, the authors are aware that other researchers have also been dealing with such issues, both in pioneering studies (Ermini et al. 2005) and in more recent times, for instance assessing the importance of the adopted variables and the appearance of the prediction map for gaining insights into model behavior (Goetz et al. 2015), evaluating the effects of spatial autocorrelation on hyperparameter tuning and performance estimation (Schratz et al. 2019), mixing training and testing set resolutions (Duric et al. 2019), exploring innovative ways of combining the results of different models (Di Napoli et al. 2020), proposing and object-based method outperforming traditional cell-based methods (Wang et al. 2021a), or combining ML algorithms with active learning strategies (Wang and Brenning, 2021). The results from these contributions will be properly considered, for their respective relevance, in the discussion session. Table 6 shows, for each article referenced, the list of ML algorithms adopted, and the location and area of the case studies. Mergadi et al. (2020) undertook an extensive analysis and comparison among many different ML techniques using a case study from Algeria covering an area of 2760 km 2 . They summarize and discuss the algorithm's accuracies, advantages and limitations using a range of evaluation criteria. As main conclusions, they highlight that tree-based ensemble algorithms achieve excellent results compared to other machine learning algorithms and that the RF algorithm offers robust performance for accurate landslide susceptibility mapping with only a small number of adjustments required before training the model. Huang et al. (2020) compared a heuristic model and two statistical models with 4 ML models (i.e. MLP-NN, BPNN, SVM, DT; see Appendix 1 for meaning of acronyms) using data from a study area of 1581 km 2 in China. They observed that ML models have higher landslide susceptibility prediction performance than general statistical and heuristic models. The main objective of the study by Bui et al. (2020a, b) was to introduce a deep learning neural network model (DLNN) in landslide susceptibility assessments and to compare its predictive performance with other four widely-used ML models. The efficiencies of the models were estimated for a case study in Vietnam covering an area of 6850 km 2 . Results showed that the proposed DLNN model had a higher performance than the four benchmark models. Pham and Prakash (2019) Table 6 Case studies, recently published by some of the most prolific authors adopting ML for landslide susceptibility analyses, comparing the performance of different ML algorithms *Check Appendix 1 for full names Study Pourghasemi and Rahmati (2018) and Youssef et al. (2016) compared the capabilities of different ML methods for landslide prone zones in India, China, Iran and Saudi Arabia, respectively, covering areas from 270 to 2400 km 2 . All the studies reported in Table 6 perform the susceptibility analyses adopting a pixelbased computational approach that can be considered "typical" of analogous studies described in the extensive literature dealing with statistically-based landslide susceptibility modeling (Reichenbach et al. 2018). They indeed discretize the study area into a regular grid whose resolution depends on the scale of the information available, i.e. a raster file in GIS environment, and they use a landslide inventory to relate a set of input conditioning factors -i.e. thematic maps-to a quantitative indicator of the model outcome -i.e. the landslide susceptibility map (Fig. 5). The jargon may be different, as input and output variables as often called features and target, respectively, in ML applications, yet the underlying principles of these data-driven landslide susceptibility analyses remain the same. Table 7 reports the main information of the landslide susceptibility computational models adopted in the different studies, and in particular: (i) the pixel resolution of the raster maps, (ii) the number of conditioning factors used as input maps, (iii) the number and typology of landslides, (iv) the number of landslide and non-landslide cells used in the ML algorithm, (v) the percentages of training and testing data used in the ML algorithm, and  (vi) the number of classes adopted in the final susceptibility map. These studies adopt a medium pixel resolution ranging from 20 m × 20 m to 30 m × 30 m. The effect of the scale adopted to consider the landslide conditioning factors, and the DEM-derived topographic variables in particular, is obvious in terms of resolution of the information provided, yet an increase in DEM resolution does not necessarily produce a corresponding increase in the output of the landslide susceptibility analysis (Guzzetti et al. 1999). Indeed, Chang et al. (2019) stated that fine DEMs account for topography variations at the micro-scale that are not very much related to mesoscale processes like landslides, and that a 30 m resolution DEM is a good option because the minimum landslide size mapped from the satellite images is 0.1 hectare (1 hectare = 100 m × 100 m). At the same time, however, one must not neglect that raster files derived downscaling DEMs that are originally available as highresolution maps derived from LiDAR or UAV surveys, can surely increase the accuracy of the susceptibility models. The number of conditioning factors employed in the analyses is always significant, ranging from 9 to 18 in the seven considered studies. As commonly done for all the pixel-based GIS models aimed at deriving landslide susceptibility maps, they include: (i) DEM-derived topographic factors, such as elevation, slope, aspect, curvatures; (ii) geomorphological factors, such as distance to rivers, drainage density, stream power and topographic wetness indexes; (iii) geological factors, such as lithology, depth to bedrock or stratigraphy, distances to faults and to other geological boundaries; (iv) land and vegetation factors, such as land use, NDVI, solar radiation; and (v) other factors, related to natural or anthropogenic features, such as average rainfall and distance to road networks. Conditioning factors should be selected according to the considered landslide typologies. Indeed, any well-defined landslide susceptibility study should clearly focus on homogeneous landslides for which an inventory is available and for which a set of thematic information can be related to the triggering mechanisms of the considered landslides.
The main focus of the seven analyses reported are translational and rotational slidetype phenomena developing, depending on the characteristics of the study area, within different materials, ranging from clayey-silty soil to course-grained soils, like debris and boulders, to rocks. All the analyses are performed considering a random portion of the landslides reported in the inventory available, ranging from 70 to 75%, to train the ML model and the remaining landslides to test the model. Likewise, all the ML analyses consider landslide occurrences within any given cell of the study area as a binary dependent variable comprising only landslides (L cells) or non-landslides (NL cells) values, and employ an equal number of L-cells and NL-cells to run the ML model, both in the training and testing phases. To this aim, NL-cells are always selected randomly among the many cells comprising the space of the study area that is free of landslides. Some of the studies, only consider one single cell per landslide to determine the L-cells, while others consider all the cells that are included in the landslide shapes at the considered map resolution. The latter at times increases the number of L-cells used in the analyses by almost one order of magnitude compared to the number of inventoried landslides. A discussion on the influence of different sampling strategies for predicting landslide susceptibility is reported by Dou et al. (2020).
Finally, the landslide susceptibility maps are always drawn by grouping a computed landslide susceptibility index in a relatively small number of classes, ranging from a minimum of 3 to a maximum of 6 classes, and assigning to each class a susceptibility indicator such as, for instance, "very low", "low", "moderate", "high" and "very high" susceptibility when the number of classes is equal to 5. Different authors have been employing different operational procedures to move from the construction of the spatial database needed to feed the ML algorithm, to the generation of the landslide susceptibility map, and to the performance evaluation of the computational model. Three main common phases of analysis may be recognized in each procedure: (i) a "factor analysis" for the selection and computation of the input and output variables of the ML model; (ii) a "model building" phase that includes the ML algorithm selection, calibration and application, up to the production of the landslide susceptibility map; and (iii) a "testing and validation" analysis to evaluate the model performance. The three phases (Fig. 6) depend on each other and are done sequentially, but they often comprise sub-phases and loops, especially when the procedure proposes to compare more than one ML algorithm to define the final landslide susceptibility map for the study area.

Factor analysis
This phase is needed to analyze the thematic information available in the case study area (landslide conditioning factors and landslide inventory) and to prepare a dataset that can be used to build an ML model. Very often, the procedures adopted in this phase, to produce the optimal set of input variables (features) that can be related to the output variable (target), are based on well-known statistical methods. For instance, Mergadi et al. (2020) include two steps in their factor analysis: construction of a spatial database from the landslide inventory map and landslide conditioning factors; optimization of the landslide conditioning factors, by means of variance inflation factors and information

TesƟng and validaƟon
EvaluaƟon of model performance 1 3 gain analyses. Similar procedures are proposed in other studies, for instance Huang et al. (2020) define the input-output variables adopting a frequency ratio bivariate statistical analysis from a set of conditioning factors in relation to landslide occurrences, and Chen et al. (2020) use a suite of statistical methods (i.e. normalized frequency ratio, variance inflation factors, and the chi-squared statistic) in their conditioning factor analysis.
Concerning the construction of the landslide event map, to be used as the dependent variable of the analysis, a binary raster variable is used in all the studies, comprising an equal number of L cells, defined starting from the landslide occurrences in the study area, and NL cells, identified based on a random selection of the landslide-free space.
The landslide conditioning factors to be used as input variables of the ML models may be obtained starting from the different data sources, such as available thematic maps, field investigations, reports, and remote sensing images. These factors are always processed using a GIS tool and converted to grid cell values, when they are not already provided in that format, with the desired analysis resolution. Data types may be discrete or continuous. Before using them as input variables of the ML analysis, extra datapreprocessing may be needed, such as numeric decoding of categorical variables or, most typically, grouping of the values of each continuous numerical factor in a finite number of classes. About the latter, Huang et al. (2020) state that the division of continuous conditioning factors will be rough if the attribute interval numbers are small, while the modeling processes will be complex if the attribute interval numbers are too many. There is no standard for determining the optimal number of classes to employ for computing the threshold values for the subdivision, yet analysts of ML studies typically adopt guidelines and suggestions commonly used in landslide susceptibility assessments (e.g., Guzzetti et al. 1999). In the seven studies considered herein, the number of classes adopted by the different authors for the continuous numerical variables needing reclassification varies between 3 and 9, and the adopted reclassification methods are: natural breaks, geometric intervals, frequency analyses, and heuristic assessments. The selection of these intervals, which requires significant subjective judgement, may be a key contributor to the model results.
It is worth highlighting that some procedures that adopt statistical analyses, before moving to the model building phase, aim at defining an optimal set of input variables for the training and validation datasets. Bivariate statistical methods, such as frequency or information gain ratios, are often used to evaluate the relevance of each conditioning factor on the results of the analysis, i.e. their predictive ability, and to assign weight coefficients to each class of each variable. The latter quantify, numerically, the probabilistic relation among the variable and the occurrence of landslides. The identified relevant conditioning factors are not necessarily independent from each other and, therefore, preliminary statistical analyses on these variables are also typically performed for multicollinearity (where two variables in a multiple regression model are highly linearly related) evaluation (Dormann et al. 2013), for instance by means of tolerances or variance inflation factor methods. Finally, the input variables are often scaled in the range 0-1.

Model building
As a first step of this phase, training and testing datasets must be defined, to respectively define the ML model and confirm its accuracy in the subsequent phase. To assess a model's predictive ability, after its definition and training, an independent dataset must be used for testing. Within standard multivariate statistical analyses, several procedures exist for testing landslide prediction models (Baeza and Corominas 2001): (i) selection of a random sample to build the model and use of the remaining population to verify it; (ii) derivation of models from different random sample sizes and checking whether the function coefficients change significantly; (iii) preparation of the model from a distribution of landslides, which occurred during a specific event, and checking it with landslides triggered by a subsequent event; and (iv) development of the model in a training area, and testing it in a target area with similar characteristics. The first-mentioned procedure is widely adopted by landslide susceptibility ML studies, which use the majority of the inventoried landslides in the study area, typically more than 70%, as a training dataset, and the remaining ones as a testing dataset, to ensure that there are enough testing samples which have not been used during the training process of ML models but used to test its accuracy. Such separation ratio has been theoretically proved (Gholamy et al. 2018). However, a higher percentage for testing can also be used if the amount of raw data is large enough. As already explained, to avoid creating imbalanced datasets between L and NL grid cells, often an equal number of NL locations is randomly sampled from the landslide-free space, both during the training and testing phases. This practice, however, may create other (unwanted) biases. The secondmentioned procedure is also at times adopted. Depending on the objective of the study and availability of data, resampling strategies can indeed be nested on top of each other (Molinaro et al. 2005). To this aim, the cross-validation (CV) resampling procedure, which is based on a single parameter k that refers to the number of groups that a given data sample is to be split into, has recently emerged as a popular method in landslide susceptibility ML models. It is indeed considered a trade-off solution between speed, accuracy and computational costs (Mergadi et al. 2020).
To date, there is no consensus on a specific "optimal" ML algorithm for predicting landslide susceptibility at territorial scale, also because the performance and the predictive ability of ML models rely not only on the fundamental quality of the algorithms but also on details of their tuning, as well as on the quality of the landslide inventory and conditioning factors employed within the study area. Therefore, most of the landslide susceptibility studies published in the literature use and compare the performance of multiple ML algorithms in the same study area, thus using the same target variable derived from a given landslide Fig. 7 Number of published journal articles dealing with ML studies for landslide susceptibility modeling (source: Scopus database, accessed 16/11/2020) inventory, and a common set of features, derived from a suite of independent and relevant conditioning factors.
The number of scientific studies published in recent years on landslide susceptibility assessment adopting ML algorithms is very high, and it is growing extremely fast. A simple search performed in the Scopus database (on November 16, 2020), using the keywords "landslide" and "machine learning" and limited to journal articles, produced 286 entries, of which 186 (about 64%) are dealing with ML algorithms applied to landslide susceptibility modeling. The yearly distribution of these journal articles (Fig. 7) clearly shows that the topic has drawn growing attention in the past few years. The ML algorithms and procedures used in these studies is very heterogeneous, and tens of different algorithms are employed to the same purpose. The fact that no articles are shown before 2011 is most likely due to the fact that the expression "machine learning" started to be widely used only a few years ago to collectively identify a set of computer-based algorithms employed to find a relationship between landslide susceptibility conditioning factors, i.e. a set of features, and the presence of landslides, i.e. a single variable expressed as a dichotomous output target. Indeed, in the previous decades, starting already in the mid-1970s (Neuland 1976;Carrara 1983), the same aim has been pursued by means of heuristic or statistical analyses, among which methods like logistic regression and artificial neural networks were also included. This is confirmed by Mergadi et al. (2020), who state that LR and ANN algorithms were the earliest ML methods applied to landslide susceptibility modeling, with a total article count of 1587 and 746 since 2000, respectively. The same authors also state that the most popular methods nowadays are SVM, DT and RF algorithms, with a total article count of 342, 247 and 179 on each algorithm, respectively, since 2010.
Overall, the seven studies presented in Table 6 employ 23 different ML models to produce the landslide susceptibility maps. In the seven study areas, from a minimum of 3 ) to a maximum of 10 (Mergadi et al. 2020) algorithms were compared, and at times also compared with other heuristic and statistical models ). The ML algorithms adopted in more than one of these studies are: SVM (5 times); RF (4 times); DT and NB (3 times); ANN, BRT, CART, GLM, and MLP-NN (2 times). Youssef et al. (2016) and Pourghasemi and Rahmati (2018) are among the first authors in the literature to present a comprehensive comparison of the performance of many different ML techniques for landslide susceptibility modeling, respectively, 4 and 10 in the two studies. Pham and Prakash (2019) compared a hybrid ensemble approach with three single prediction models. Huang et al. (2020) chose 5 ML algorithms to compare among the ones most widely used in landslide susceptibility studies. On the other end, Chen et al. (2019) focused their comparison between NB and other two methods (KLR, RBFN) that have seldom been explored for landslide susceptibility modeling. Bui et al. (2020a, b) introduced a new deep learning neural network algorithm (DLNN) and compared its predictive performance with other four state-of-the-art ML models (RF, SVM, DT, MLP-NN). Mergadi et al. (2020) highlighted the importance of configuring and training the different ML algorithms one wants to compare for a given case study, using common hyper-parameter tuning strategies for the ML algorithms they compare.
The final step of the model building phase is the production of the landslide susceptibility maps, one for each algorithm adopted. As already mentioned, after a landslide susceptibility index is computed for each pixel of the study area, the final map is usually drawn considering a relatively small number of classes to which susceptibility indicators are attributed. The number of classes employed in the seven studies presented in Table 7 range from 3 (Pham and Prakash, 2019) to 6 (Bui et al. 2020a, b Bui et al. (2020a, b) acknowledged that the most common classification scheme in landslide susceptibility assessments use a five-level scale, including the "very low", "low", "moderate", "high" and "very high" susceptibility indicators (Fell et al. 2008). At the same time, however, they introduced an extra "no susceptibility" class in their study, given that a very large portion of the study area had an extremely low value of the computed landslide susceptibility index.

Testing and validation
Performance assessment for landslide susceptibility computational models can be conducted at two different levels (Table 8): (1) evaluating the quality of the classification problem with the binary model outcome of presence or absence of landslides; and (2) assessing the final landslide susceptibility map, i.e. validating the area covered by each susceptibility class against the landslide density distribution of the adopted landslide inventory map. In relation to the first level of testing, the common performance metrics that are typically adopted in the literature include: • Various metrics derived from a confusion matrix representation of the results (CM), including overall accuracy (Acc), specificity (Sp), sensitivity (Se), F-score (F) and others; • The area under the ROC curve (AUC), computed as the integral over the graph that results from computing false positive rate and true positive rate for many different thresholds; • Expressions quantifying the error of the analysis by means of an objective function (OF), like the mean absolute error (MAE) and the root mean square error (RMSE); • The Cohen's kappa index (kappa), expressing the proportion of observed agreement beyond that expected by chance; • Reliability diagrams (RD) and distributions of the computed landslide susceptibility indexes (LSI). In addition to these metrics, when multiple ML algorithms are compared for a single study area, like for the case studies reported in Table 6, null-hypothesis testing (NH), such as the Wilcoxon signed-rank (WT) or the chi-square (X 2 ) tests, can also be conducted to assess the statistical significance of the differences between the model outcomes.
The second level of assessment for landslide susceptibility modeling is based on the assumption that a model is accurate when the landslide density ratio increases moving from low susceptibility classes to high susceptibility classes, and when the high susceptibility classes cover small extent areas (Pradhan and Lee 2010). To this aim, a necessary step is the reclassification of the landslide occurrence scores computed by the ML algorithms into a given number of classes expressing a susceptibility level, by means of an indicator, within the landslide susceptibility map. The areal extent of each susceptibility class can then be validated against the landslide density distribution from the landslide inventory map, by means of what is sometimes called a sufficiency analysis (SA). In addition to this qualitative evaluation of the output map, success and prediction rate curves (SPR) can also be drawn and the corresponding AUC computed.

Temporal forecasting
Temporal predictions of landslides, and more generally forecasts of the time evolution of key factors affecting the slope safety level, can be performed at global/regional scale or at slope-scale. The choice of the scale is usually linked to the choice of the monitored parameters, which is in turn related to the type of landslides. Typically, regional scale predictions are accomplished by using rainfall monitoring, geomorphological, and hydrometeorological approaches, while slope-scale predictions take advantage of a geotechnical engineering method relating displacement or other monitoring data to the time of failure (Intrieri et al. 2019). There is a relationship between monitoring parameters and types of landslide; for example, for shallow landslides that are triggered by extreme precipitation events, or by a combination of hydro-meteorological events, meteorological data dominates monitoring parameters. For slow moving deep-seated landslides, displacement monitoring can be a crucial input to assess slope behavior. New data assembling methods and Internet of Things (IoT) techniques have recently started to provide large datasets of monitoring data for landslide temporal forecasting using ML techniques. In this section, we review and discuss the main characteristics of available published studies (not very numerous up to the year 2020) that apply ML in landslide temporal forecasting.

Landslide displacement prediction at slope scale
Landslide displacement forecasting is considered an essential component for developing modern early warning systems. It can be used to set warning thresholds and to recognize when a landslide undergoes a sudden acceleration, which may lead to failure. Time series of real-time data collected from landslide monitoring systems, e.g. Geophones, Interferometric Synthetic Aperture Radar (InSAR), and Global Navigation Satellite System (GNSS), along with triggering data, e.g. water level and precipitation, provide critical inputs to ML modelling in this domain. However, the prediction of landslide displacement that changes over time is very challenging and it is inevitably linked to complex deformation mechanisms in the slope. Application of conventional ML methods, e.g. SVM, ANN, in landslide displacement forecasting, are reported in Mayoraz et al. (1996), Mayoraz & Vulliet (2002), Ran et al. (2010), Zhu & Hu, (2012), and Du et al. (2013). By using ANN, Mayoraz et al. (1996) and Mayoraz & Vulliet (2002) predicted the velocity changes in a sliding soil mass based on meteorological and physical data and different neural network configurations. It must be noted that the future landslide velocity was the predicted parameter, instead of landslide displacement and input parameters in the multilayer perceptron neural network (MLP-NN) included daily precipitation, evaporation and pore water pressure. They showed that it is possible to obtain a reasonably good short-term (up to few days) prediction of landslide movements using a considerable number of continuous measurements. However, Mayoraz et al. (1996) concluded that MLP model yielded less precise predictions on the test set than on the training set, which is a sign of overfitting. In recent years, advances in DL algorithms and hybrid algorithms that combine different ML techniques, mainly performed on active landslides in China on the slopes of the Three Gorges Reservoir Area (TGRA, see Table 9), have shown promising results in the modelling and prediction of landslide deformations. For time series problems, advanced neural networks are generally considered as the most promising solutions since well-designed network structures could help to handle sequence dependence in the time series data (van Natijne et al. 2020).
In general, landslide displacement predictions include the following steps: (i) decomposition of the accumulated displacement, (ii) selection of conditioning factors, (iii) establishment of predicting models, and (iv) evaluation of prediction results. Wang (2003) and Du et al. (2013) proposed that the accumulated displacement (D) time series could be decomposed into three components: a trend, a periodic, and a stochastic component, i.e.
The long-term displacement, controlled by "internal" geological conditions such as lithology, geological structure and progressive weathering, is typically assumed as the driver for the trend component ( ). The short-term displacement, in this framework called the periodic component (P), is assumed to be influenced by "external" factors such as rainfall. The stochastic term (S) is the displacement response caused by a sudden change in the system, e.g., a raise or drop of the reservoir level (for TGRA) affecting the landslide hydraulic boundary conditions. In most of studies on landslide displacement in TGRA, the periodic and stochastic terms were not separated, or the stochastic term was completely ignored. The periodic term of displacement was believed to be caused by periodic reservoir water level fluctuations and rainfall. ML algorithms have been applied, in the literature, to predict the periodic term in the displacement time series that expresses the relationship between landslide displacement and its conditioning factors, e.g. precipitation and/or dam reservoir level.
The most recent studies on this topic are summarized in Table 9. ML algorithms have proven to be quite successful for forecasting the periodic component of landslide displacements obtained after removing the trend term from the accumulated displacement. Various ML algorithms have been tested for the prediction of periodic landslide displacement. However, in most of these studies, only one landslide case was used to verify the applicability and superiority of their proposed algorithm, which therefore may not perform well on other landslides. In some of the studies, e.g. Ma et al. (2020), Xie et al. (2019) and Krkač et al. (2017), only one ML algorithm was used for the landslide displacement prediction.
Commonly used controlling factors in the studies in TGRA include antecedent rainfall and reservoir water level over time and evolution state (e.g. Du et al. 2013;Yang et al. 2019;Zhou et al. 2018a) measured over 1 to 3 months before the event date. Not all controlling factors that may be related to landslide deformation can be used as input variables (1) D = + P + S Table 9 Recent case studies adopting ML for landslide displacement forecasting

RF
Kostanjek landslide (Croatia) n.a for landslide displacement prediction in the ML models, because the ones having a low correlation with landslide deformation make the ML models complex and may reduce prediction accuracy. The controlling factors that have a strong correlation with the periodic displacement are typically selected by conducting correlation analyses, e.g. gray relational analysis (Deng 1989), and maximum information coefficient (Reshef et al. 2011). The Baishuihe landslide, at the shores of TGRA, offers some possibilities for comparison, as multiple methods have been tested on this landslide by various authors. The Baishuihe landslide is a retrogressive landslide, where deformations occurred first at the bottom of the slope and retrogressed upwards (Du et al. 2013). The landslide reactivates frequently and have had several intense deformation periods since 2003. As indicated in Table 10, DL (e.g. DBN, LSTM) or hybrid ML methods show excellent prediction performance. However, the influence of the reservoir water level on the landslide stability, which is common to TGRA and not often present elsewhere, cannot be neglected and conclusions are therefore not easily transferable to other landslides.

Rainfall-induced landslides
For rainfall-induced landslides, a threshold defines the rainfall conditions that, when reached or exceeded, are likely to trigger a landslide. During the last decades, landslide rainfall thresholds have been mainly determined empirically or by adopting statistical methods (Segoni et al. 2018). ML methods are recently being explored to this aim. As an example, the conventional ML algorithm SVM has been used to determine rainfall thresholds by various authors (Vallet et al. 2013;Rachel and Lakshmi, 2016;Omadlao et al. 2019). At a nationwide level in Japan, Osanai et al. (2010) developed a new early-warning system for debris flow and slope-failure disasters. They used the rainfall indices of 60-min cumulative rainfall and calculated a soil-water index to set up a critical line (CL) employing a Radial Basis Function Network (RBFN). Osanai et al. (2010) state that the result of the system operation in 2009 proved its effectiveness in predicting rainfall-induced landslides. As no other references were found in the literature, we do not know if the identified thresholds have been subsequently validated. ML methods have also been used to explore the relationship between amount of precipitation and groundwater level, a condition that is more closely linked to the pore pressure increase and shear strength reduction within the slope that leads to an instability, especially for deep-seated landslides. Yoon et al. (2011) developed two nonlinear time-series models using ANN and SVM techniques to predict groundwater level fluctuations based on data for the groundwater level, precipitation, and the tide level. Krkač et al. (2017) predicted the fluctuation of the groundwater level for the Kostanjek landslide using RF method. Huang et al. (2017) proposed a PSO-SVM model based on chaos theory to predict the daily groundwater levels of the Huayuan landslide and the weekly, monthly groundwater levels in Baijiabao in the TGRA of China. Wei et al. (2019) studied two different ML methods, i.e., the genetic algorithm back-propagation neural network (GA-BPNN) method and the genetic algorithm SVM (GA-SVM) method, for predicting the ground water level fluctuation of the Duxiantou landslide located in Zhejiang Province, China.

Dynamic susceptibility mapping
Landslide susceptibility mapping using ML methods has been intensively investigated by different researchers, as already mentioned. However, such studies do not intend to predict the time of occurrence of the landslides. Recently, the interest for dynamic susceptibility mapping, or spatio-temporal landslide probability assessment (e.g., Lombardo et al. 2020;Wang et al. 2022), increased. Several works have been conducted to explore approaches for spatiotemporal landslide forecasting using conventional ML methods, e.g. SVM (Farahmand & AghaKouchak, 2013;Rachel and Lakshmi 2016;Omadlao et al. 2019), ANN , Decision Tree (Kirschbaum et al. 2015;Kirschbaum and Stanley 2018).
A few recent studies utilizing (hybrid) ML algorithms for dynamic susceptibility mapping are summarized in Table 11, showing the ML algorithms adopted, and the location and time period of the case studies. Stanley et al. (2020) identified where and when landslides were most probable, across relatively large ecoregions over the years 1976-2016, using an XGBoost model. XGBoost method was proven to be an effective method for incorporating rainfall intensity, atmospheric rivers, antecedent soil moisture, and melting snow from land data assimilation systems into a unified indicator of rainfall-triggered landslide hazard. Lee et al. (2021) proposed an MLP-NN enhanced with Gumbel distribution approach to assess the temporal probability of future landslide occurrence using the limited rainfall records and landslide inventory in a study area in Jinbu, Korea. MLP-NN was used in static landslide susceptibility analysis with the balanced pixel data. An ROC graph and the associated AUC were used to verify the accuracy of the susceptibility map by comparing actual and estimated results. Finally, the temporal probability of landslide occurrence, evaluated, using the Gumbel model, with 72-h antecedent rainfall threshold was combined with the spatial probability of landslides to determine landslide hazard. Utomo et al. (2019) proposed a hybrid model based on physically-based stability method and ADASYN (Adaptive Synthetic Sampling)-BPNN (Backpropagation Neural Network) to design an accurate early warning system. The proposed method had higher accuracy than BPNN and ADASYN-BPNN without physically-based stability analyses, but required more computational time and resources. Lombardo et al. (2020) proposed a novel Bayesian modelling framework for the spatiotemporal prediction of landslides. The spatial predictive performance of Bayesian models was quantified using a tenfold cross-validation procedure, and the temporal predictive performance using a leave-one-out cross validation procedure. Wang et al. (2022) established a space-time susceptibility model for hydromorphological (HMP) processes covering the Chinese territory from 1985 to 2015.  2003, 2007, 2008 and 2009 The space-time model was built on the basis of a binomial Generalized Linear Model (GLM), producing the mean, maximum and 95% confidence interval of the spatio-temporal susceptibility distribution per catchment, per year.

Objective of landslide studies using ML
ML algorithms aim primarily at making accurate predictions, while explanation can be regarded as a secondary objective. Taking this into account, applications of ML methods in landslide studies should be mainly focused on problems where the need for predictions prevails over explanation and understanding. Such is needed when sufficient quantity of data exists and time is the key deciding factor, e.g. time to occurrence of an event or time for developing and conducting a study. Example of the former can be Landslide Early Warning Systems, where it is crucial to make a decision in a limited time based on streams of monitoring data. Example of the latter can be landslide detection in which collecting detailed field data requires many days and sufficient manpower (Mondini et al. 2011). When the objective of the landslide study is deep understanding of processes, we do not see the usefulness of a direct application of ML. However, also in these cases, features detected by ML, for instance related to the importance of conditioning factors in landslide spatial prediction studies, may help understanding landslide processes. In terms of future scenarios, we argue that ML methods are useful when interpolation is the main purpose, meaning that the machine has already learned from a broad spectrum of data and the new occasion falls within the available data space (similar statistical distribution). If the new occasion falls outside the available data space, i.e., an extrapolation problem, ML methods may not perform well.

ML and DL algorithms
There is no consensus on an "optimal" ML/DL algorithm for landslide studies, even when looking at the results of the most recent comparative studies in landslide detection or spatial and temporal forecasting. Indeed, there is a growing tendency in the literature to propose the systematic use of an ensemble of algorithms for the same study area, not only native ensemble ML algorithms such as RF but rather various different ML algorithms, and then choose the best-performing one. As indicated by Ghorbanzadeh et al. (2019b) and Prakash et al. (2020) in landslide detection studies, comparisons between conventional ML algorithms and DL methods reveal that algorithmic choice faces the so-called No Free Lunch theorem, which implies that there is no single "best" algorithm to look after because, on average, all algorithms will perform about the same (Wolpert 1996). The choice between adopting conventional ML or DL algorithms primarily depends on the type and quantity of available data. In general, DL algorithms are not expected to outperform conventional ML if the size of training data is not very large. For instance, for landslide spatial prediction studies, the amount of past information on known locations of landslides is typically very low compared to the extent of the landslide susceptibility study areas. Number of features, attributes and preference over feature engineering also affect this choice. We suggest that for structured data, conventional ML algorithms are to be preferred, whereas for unstructured data (e.g. text, video, imagery, etc.), where feature engineering can be a daunting task, DL algorithms can be more suitable.

Availability of ML/DL libraries
There is no consensus on what methods can indeed be properly called ML algorithms and some well-known inferential statistics methods, like various types of logistic regressions or discriminant analyses, are often referred to as ML algorithms. When ML/DL algorithms are used in applied science and engineering, including the landslide community, there is an overall tendency to use off-the-shelf algorithms that are already implemented in free libraries. Python libraries such as Scikit-learn for conventional ML and TensorFlow, Keras and PyTorch for DL algorithms are among them. A possible drawback is that such tendency can lead, in the long run, to ML illiteracy of the landslide community because there is no effort in implementing and deeply understanding the algorithms, which can also result in misusing them or leaning towards trial-and-error. An example that supports this claim is related to the hyperparameters of ML algorithms. In most of the studies reviewed for this paper, authors either used the default values of hyperparameters or chose them through trial and error. It can also be seen that in many DL-based landslide studies, the architecture of the DL framework is not properly explained and no efforts are spent to deeply understand why certain architectures work better than others. Another possible drawback of leaning on these implementations is that researchers will have to wait quite some time before the emergence of new promising algorithms well suited for landslide studies.

Data availability
Data-driven methods, such as ML algorithms, are not useful if the necessary data is not available. In fields such as landslide detection and landslide susceptibility mapping, where publicly available satellite images at various resolutions exist, data availability can be less problematic. However, in temporal forecasting employing monitoring data (e.g. groundbased sensors, InSAR data), good quality data do not exist freely, and this condition surely limits the application of ML algorithms. It may be expected, however, that in the near future this limitation may be overcome by the growing availability of remote sensing data and the growing competition within the remote sensing community. Datasets dedicated to ML landslide studies can thus produce a significant shift in the current way of forecasting landslide displacements. Examples of datasets already available in the ML domain can be found at: https:// www. paper swith code. com/ datas ets.

Code availability
The majority of the works reviewed in this study did not make the utilized computer scripts available. Within the fast-growing ML community (see https:// paper swith code. com/), availability of the script and the data used are important criteria for assessing the credibility of a study. It can be argued that such intellectual opacity in the landslide ML literature will hamper the utility of these studies because researchers, even assuming that they may have access to the original data, in a majority of cases cannot duplicate them.

Pre-trained models
In the Computer Vision community, algorithms pre-trained on large datasets exist. When it comes to applying these algorithms to a similar problem (e.g. image classification), instead of training the original algorithm from scratch, ML engineers use these pre-trained algorithms to save time and to reduce the need for more data. Such ideas can be used in landslide detection and landslide susceptibility studies, also to reach higher accuracies over time.

Physically-based methods versus ML
Compared to ML-based models, physically-based models require less data for calibration, as they are fully or partly based on well-established laws of physics. The two classes of methods are typically seen as alternatives to each other, and data-driven models, including ML algorithms, are often called upon only when the use of physically-based models is deemed unfeasible or cost prohibitive. In fact, in landslide studies we may state that ML algorithms are currently being adopted as tools for all those data-driven analyses that, in the past, would have seen researchers use, for the same purposes, statistical techniques. However, physically-based methods can help ML in various ways: (1) make ML models more explainable, (2) decrease the volume of data that is needed to train ML algorithms, (3) produce synthetic data for data-scarce problems (e.g. Jamalinia et al. 2021). The integration of ML methods in physically-based models is also a path that is currently being explored by researchers in engineering and science. Examples of this approach can be found, for instance, in the computational fluid dynamics community, where ANN has been used for solving partial differential equations used to simulate fluid dynamics problems (e.g., Kutz 2017;Schenck & Fox 2018;Clark Di Leoni et al. 2020).

Supervised, unsupervised and reinforcement learning
The majority of landslide/ML studies reviewed herein used supervised ML. Such widespread use of supervised learning is also common in other engineering and science fields. Unsupervised machine learning methods are not very popular, mainly because they do not suit labeled datasets. However, these methods can be helpful for finding anomalous data of geo-systems including natural and engineered slopes. This can be very useful, for instance, in early warning systems. Some advanced unsupervised learning methods, such as GANs and Autoencoders, have found applications in landslide detection (e.g. domain adaptation in Fang et al. 2020) and landslide susceptibility mapping. It can be foreseen that these advanced methods will receive more attention from landslide researchers in the future.
Reinforcement learning (RL) is currently mostly used in research, but the approach already shows maturity in problem-solving for game like scenarios. As suggested by Bergen et al. (2019), there have been efforts on using RL methods in earth sciences and particularly in earthquake and seismicity related studies (e.g. Delores et al. 2018). To our knowledge and up to the year 2020, however, there are no published applications of RL to landslide studies. However, it is to be expected that, due to the necessity of rapid and datadriven decision making in issues related to landslide risk assessment and management, landslide studies will adopt in the future RL techniques.

Statistics versus ML
ML and statistically-based approaches for the detection and spatial prediction of landslides over large areas share many common characteristics. Therefore, it is not strange that most of the recent spatial forecasting studies adopting ML algorithms significantly "draw" from the experience accumulated in the past decades, since the seminal publication by Varnes (1984) on bivariate and multivariate statistical techniques and procedures for landslide susceptibility assessment and zoning. The main consequence is that almost the totality of the ML literature contributions on this topic (to the Authors' knowledge) employ a "standard" pixel-based computational approach to perform the susceptibility analyses. Therefore, even if the adopted jargon may be different, the essence of the ML analyses is the same as for any other data-driven approach for deriving a landslide susceptibility map in GIS environment, starting from a set of input conditioning factors and a landslide event map. Statistical methods focus on inference, achieved through the creation and fitting of a problem specific probability model, whereas machine learning methods concentrate on prediction by using general-purpose learning algorithms to find patterns in often rich and unwieldy data (Bzdok et al. 2018). From this perspective, machine learning methods are potentially more powerful in forecasting landslide patterns. Most of the issues highlighted to explain the performance of the models are commonly treated, outside the specific ML literature, whenever geospatial data-driven analyses are performed (e.g. Goetz et al. 2015;Reichenbach et al. 2018;Lombardo et al. 2020). Examples of such specificities are: resolution of information and mapping units (e.g. Calvello et al. 2013), preprocessing of conditioning factors (e.g. Guzzetti et al. 1999), low number of landslide cells in relation to non-landslide cells (e.g. Tanyu et al. 2021), influence of sampling strategy (e.g. Wang and Brenning 2021), validation practices (e.g. Steger et al. 2016), number of classes of input and output variables (e.g. Baeza et al. 2016). A discussion of these items, which are very relevant for the implementation and the applicability of data-driven techniques for landslide spatial forecasting in operational settings, goes beyond the scope of this paper.

Generalization and evaluation of the models
Model generalization is an important aspect of ML modeling that is neglected by many researchers in ML landslide studies. For instance, in landslide susceptibility mapping, most studies verify the superiority of their proposed method(s) by comparing a small number of ML algorithms in a common area. However, the proposed models are not repeatedly tested and may not outperform other methods in areas other than training areas. In fact, there is still a lack of benchmark case studies available for testing various ML methods.
In the landslide detection and spatial forecasting studies, the evaluation of the performance of a given ML algorithm is typically done by checking the quality of the classification problem with the binary model outcome of presence or absence of landslides. To this purpose, the mostly used performance indicators are the area under the ROC curve and various metrics derived from the confusion matrix. Less common, but nevertheless used, are reliability diagrams, expressions quantifying the error of the analysis by means of an objective function, and null-hypothesis testing, which is more common in statistics-based studies. According to the recent literature on ML for landslide spatial forecasting, ML models have, typically, a higher landslide susceptibility prediction performance than statistical and heuristic models. This finding is not surprising, given that this is a pre-requisite, if not the main justification, for a scientific article to be published on this topic. However, the case studies often report (rather suspiciously) very high values of performance indicators for many algorithms in the same area, thus the alleged "success" of an ML algorithm over the others is too often attributed because of rather small differences in the values of its performance indicators. In the landslide temporal forecasting studies, the evaluation of the performance of a given ML algorithm is typically done based on quantitative performance metrics, e.g. MAPE, MAE, RMSE, MSE, and R.

Relevance of expert opinion
ML landslide studies always need the analysist to decide much more than just the algorithm(s) to use in a given analysis. Indeed, the landslide studies discussed in all the three main sections of this paper, i.e. detection and mapping, spatial forecasting and temporal forecasting, are most often proposing procedures that fulfill the objectives by means of a combination of methods, which include ML algorithms, and a set of heuristic expert choices. This is important to recognize when evaluating, or comparing, the performances of given ML algorithms, as they cannot be easily entangled from the other elements comprised in the proposed procedures.
Expert knowledge plays a significant role in enhancing the performance of ML models. Feature selection heavily relies on expert knowledge in both spatial and temporal landslide predictions. Expert opinion is also reflected in algorithm selection and implementation.
Taking spatial prediction studies for example, the recent trend is to adopt computational procedures that combine many algorithms and methods, including standard statistical analyses, to address the different phases of the landslide susceptibility analysis. For instance, in the initial factor analysis, bivariate statistical methods are used to evaluate the relevance of each conditioning factor and to assign weight coefficients to each class of each variable; cross-validation is used for checking whether weight coefficients change significantly upon resampling; and input variables are checked for multicollinearity evaluation.

Conclusions and perspective
In this paper we provided a detailed overview of machine learning and ML studies pertained to landslide detection and mapping, spatial forecasting and temporal forecasting. In addition to the three sections of the paper explicitly devoted to these topics, the main general observations on different aspects of ML-based landslide studies were presented in the final Discussion section of this paper. Our review revealed that over the years the complexity of ML algorithms used in landslide studies has been matching the rapid development that is occurring in the AI/ML community. Likewise, it can be stated that ML still has a long path to follow in the landslide community.
Out of the three landslide subfields investigated herein, it seems that landslide detection studies are the ones which benefited the most from ML progresses, whereas it appears that spatial and temporal forecasting still did not get a clear and distinct advantage from incorporating ML algorithms in their studies. This is mainly because landslide detection is essentially a Computer Vision (CV) problem, for which there is an active community within the AI community, where many developments are carried out. Those developments, as well as the fact that landslide detection does not require much physical understanding compared to other landslide research areas, encourage implementation of robust ML algorithms for more accurate landslide detection. It also can be expected that the application of DL to this aim will further increase in the coming years, thus replacing more traditional methods such as OBIA. Within DL methods, it is expected that more advanced CV algorithms will replace conventional ones. Methods such as Graph Neural Networks and various generative modeling methods such as GANs are foreseen to find more applications in landslide studies. In landslide spatial forecasting, the number of publications adopting ML algorithms in landslide susceptibility studies has been growing at a very fast pace in recent years, with a trend that is resembling the growth shown, in the previous two decades, by multivariate statistical studies conducted with the same purposes. In this area, we expect that the current trend that focuses on the use of different ML algorithms and compares their performance within a common area, will remain the main not-too-innovative procedural strategy explored by researchers, at least in the near future. Nevertheless, given the redundancy of these studies, we can surely hope that a new trend will emerge, possibly combining ML and process-based methods for a more robust and generalized assessment and understanding of landslide susceptibility at regional scale. Also in landslide temporal forecasting, it may be expected that procedures will be developed that combine ML algorithms with physically-based methods, such as computational geomechanics models. For the temporal prediction of slope failure processes, probabilistic ML/DL, such as Bayesian DL, may also possibly be an emerging trend.
In conclusion, we can confidently state that ML is a vibrant field with expanding interest and rapid advancement. We do not encourage landslide researchers to follow the same pace of ML progress in implementing ML algorithms for landslide studies, as it can jeopardize the deep understanding of both processes and ML methods. However, we do encourage the landslide community to closely observe ML upgrades and get inspiration for implementing innovative data-driven methods in landslide studies. It is surely a challenge to use ML algorithms appropriately to advance the field of landslide studies, yet the growing interest shown in recent years for such endeavors is promising. There is potential for a wider use in practice and consultancy in the future, but further research is surely needed to this aim.