1 Introduction

Spatial data mining reveals hidden and previously unknown but potentially informative patterns from big and high-dimensional geoscience data. It takes advantage of the ever-growing availability of geographically referenced data and their potential abundance (Sellars 2018). Many geomatic applications benefit from spatial data mining in several stages, including data collection, data storage, exploratory data analysis, data processing, prediction, and uncertainty quantification. For instance, spatial data mining techniques can be used for splitting the study area to account for different behaviours of natural phenomena, which is useful for discovering earth processes and simplifying subsequent modelling steps (Rolnick et al. 2019). However, understanding the particularities of geosystems and geoscience data is critical for obtaining accurate and physically consistent inferences and predictions via data mining approaches (Reichstein et al. 2019; Talebi et al. 2020).

Geoscience processes vary significantly through time and space. Such heterogeneity and non-stationarity are related to the spatial and/or temporal variation of soil types, rock types, land uses, vegetation types, climatic conditions, and tectonic activities. Geographical observations that are located close to each other in space and time tend to share similar characteristics. This phenomenon is known as auto-correlation and provides additional information to inform statistical models (Matheron 1962; Cliff and Ord 1973). Remotely sensed data are examples of earth observations that show spatial and/or temporal auto- and cross-correlations. Generally, pixels that are located close to each other in satellite images are more likely to have similar values compared to those that are spaced further apart (Woodcock et al. 1988). Traditional data mining algorithms treat observations as independent values (i.e., independent and identically distributed data) and exclude useful information from the analysis by disregarding space and time dependencies. Consequently, predictions and inferences from non-spatial learners can be misleading when applied to geoscience data (Reichstein et al. 2019; Bergen et al. 2019; Karpatne et al. 2019).

Spatial data mining techniques should be able to capture multivariate spatial and/or temporal patterns of different scales and types. Either current machine learning (ML) algorithms can be amended to be consistent with the nature of geoscience data or new algorithms need to be developed (Karpatne et al. 2019; Talebi et al. 2020). Among the statistical learners, tree-based techniques are very useful for geoscience data analysis and modelling due to their transparency, simplicity of implementation, ability to capture non-linear relationships, and ability to handle big and high-dimensional data, mixed data, and missing values (Kuhn and Johnson 2013). The tree-based learners can be improved further by considering the heterogeneity and spatial dependency of geoscience data.

Hengl et al. (2018) showed that adding covariates such as Euclidean buffer distances to observations to a random forests (RF) model produces estimates as accurate and unbiased as those from the common geostatistical method, kriging. Liu et al. (2018) used a combination of RF and regression kriging to measure particulate matter concentration derived from remotely sensed and ground observations. In their proposed methodology, a non-linear trend between the dependent variable and covariates is predicted by RF. Subsequently, kriging is used to estimate residuals of the predicted trend (Hengl et al. 2015). Meyer et al. (2019) argued that the nonspatial validation procedures based on the random sampling of a dataset (e.g., k-fold cross-validation and bootstrapping) are inappropriate for learning from spatial data. Nonspatial sampling results in over-optimistic predictive models (caused by the spatial correlations) that can predict the input training data accurately but have marginal performance in terms of extrapolation (i.e., predicting patterns that have not been seen during the training process). Moreover, using geolocation variables such as latitude and longitude may generate several artefacts in the final predicted maps. Talebi et al. (2019) introduced a new method for spatial uncertainty quantification of geoscience data based on a combination of geostatistical simulation and random forests predictive models. Approaches to deal with compositional data were also discussed. Georganos et al. (2019) introduced a locally varying RF predictive model and concluded that the performance of the technique can be improved when an appropriate spatial scale is selected. Finally, Mitchell and Sheppard (2019), inspired by convolutional neural networks, demonstrated how the performance of the RF algorithm for recognising spatial patterns can be improved by introducing a local spatial bias during the learning process.

The previously mentioned possibilities to improve the spatial awareness of the tree-based learners are not suitable for recognising complex spatial patterns, objects, and structures of different scales and their spatial distributions across the domain of study. Most of these techniques only use the information available at single points or rely on two-point geostatistics (e.g., variogram models and kriging techniques) to capture spatial dependency. Considering only two points or pixels is not sufficient for capturing complex spatial patterns. Multiple point geostatistics were developed to address this limitation by considering more than two points simultaneously (Mariethoz and Caers 2015). In addition, approaches for accurately handling multi-resolution spatial data and missing values must be improved further.

The objective of this study is to develop a spatial random forests (SRF) technique based on nonparametric higher-order spatial statistics for spatial data analysis and modelling. The proposed model can be applied to the high-dimensional and nonlinear phenomenon with a small number of observations and multi-resolution predictors of mixed type (e.g., continuous and categorical). In the proposed technique, the dimension of input data is increased to capture local information and to learn intrinsic heterogeneity, spatial dependencies, and complex spatial patterns. Compared with the previous attempts to improve the spatial awareness of tree-based learners, the proposed technique can be considered as a truly spatial predictive model. Approaches to handle big data, multi-resolution data, and missing values are discussed. Algorithms for supervised (i.e., regression and classification) and unsupervised (i.e., dimension reduction and clustering) learning are presented.

The following sections discuss the basics of the proposed methodology and provide the implementation details of the SRF model. To demonstrate the applicability of the proposed technique, extensive use of one synthetic and one real dataset is made. The objective of the synthetic case is to identify complex spatial patterns using unsupervised SRF, while the focus of the real case is to reproduce the interpreted geological map and to predict the geochemistry of the formations in the North West Minerals Province (NWMP) of Queensland, Australia. The paper is concluded with a short review of the pros and cons of the proposed technique and recommendations for future research.

2 Methodology

The methodology is based on the classical binary recursive partitioning tree (Breiman et al. 1984) and the classical random forests algorithm (Breiman 2001). In the following subsections, the classical decision tree and random forests algorithm are extended to account for the complex spatial dependencies of regionalised variables. The local spatial-spectral information is captured by extracting and vectorising multivariate spatial patterns. Such local information is used to learn intrinsic heterogeneity, spatial dependencies, and complex spatial patterns during the training process of regressors and classifiers. Approaches to account for multi-resolution data and missing values are discussed. Finally, a novel approach to generate synthetic multivariate spatial patterns is proposed for unsupervised modelling of spatial data.

2.1 vectorised Spatial Patterns

Supervised and unsupervised learning models are trained using a set of N observations at locations ui i = {1, ... N} (Fig. 1(a)). These observations can be multivariate (e.g., of dimension R) and mixed (e.g., continuous and categorical, Fig. 1(b) to (d)). It is assumed that the observations are located on regular grids. Non-gridded observations should first be migrated to the closest nodes of a regular grid of suitable resolution or be rasterised using geostatistical techniques. For each location \({u}_{i}\) and regionalised variable \({x}^{r}(u)\), \(r=\{1,\ldots,R\}\), a local spatial pattern is extracted and stored as a vector \({pat}_{r}\left({u}_{i}\right)=\left[{x}^{r}\left({u}_{i1}\right),\dots ,{x}^{r}\left({u}_{i{E}_{r}}\right)\right]\) which consists of the values of the rth variable at \({E}_{r}\) nodes around the location \({u}_{i}\). The parameter \({E}_{r}\) controls the order of spatial statistics (i.e., number of nodes in the extracted pattern) for each regionalised variable. The univariate spatial patterns can have different shapes and scales. In this illustration, the spatial patterns have square shapes with different sizes and numbers of nodes (Fig. 1(e)). To store square patterns as \({pat}_{r}({u}_{i})\) vectors, they are stored pixel by pixel from left to right and row by row downward. The overall multivariate spatial pattern (Fig. 1(e)) for each location \({u}_{i}\) is stored as a long vector \(pat({u}_{i})=\left[{pat}_{1}\left({u}_{i}\right),\dots , {pat}_{R}({u}_{i})\right]\). The total number of predictors is \(P = \sum_{r=1}^{R}{E}_{r}\). To make the technique invariant in terms of rotation, scaling, and distortion, the transformed patterns should be learned during the training process. For instance, the rotated multivariate patterns, \(rot(pat\left({u}_{i}\right))\), can be added to the information about location \({u}_{i}\) to be used later during the training process. In the proposed method, only a simplified version of rotation invariance is used by rotating multivariate patterns through angles of 90 degrees, which quadruples the number of observations \(N\) (Fig. 1(f)). To train a supervised spatial model (i.e., classification and regression), the response values \(y\left({u}_{1}\right),\ldots ,y\left({u}_{N}\right)\) must be stored with the predictors. The final training dataset \({\mathcal{D}}=\left\{\left(pat\left({u}_{1}\right),y\left({u}_{1}\right)\right),\dots ,\left(pat\left({u}_{N}\right),y\left({u}_{N}\right)\right)\right\}\) will be used to train the SRF model.

Fig. 1
figure 1

Preprocessing the input spatial data, a prediction grid, b categorical variable, c high-resolution continuous variable, d low-resolution continuous variable, e extracted multivariate pattern for location ui, e rotated patterns for invariant training, and g the proposed process for generating synthetic patterns for unsupervised learning

2.2 Spatial Decision Trees

A spatial tree partitions the predictor space via a sequence of binary splits on individual predictors \({x}^{p}(u), p = \{1,\ldots ,P \}\). The root node of the tree comprises the entire vectorised multivariate patterns \(pat({u}_{i}), i = \{1,\ldots ,N\}\). The final partitions of the predictor space are associated with the terminal nodes (i.e., the nodes that are not split). The internal nodes are split into two descendant nodes (i.e., right and left nodes) according to the value of one of the predictors. If a predictor is continuous, a split is determined by a cut-off value; multivariate patterns with the selected predictor smaller than the cut-off go to the left, the remainder go to the right (Fig. 2). Similarly, for a categorical predictor with finite levels \({S}_{i}\), a binary split is obtained by defining a subset of levels \({S\subset S}_{i}\).

Fig. 2
figure 2

Spatial decision tree for learning multivariate spatial patterns

A splitting criterion must be defined to find the best predictor and the best split. For a continuous response variable with values \(y\left({u}_{1}\right),\dots , y\left({u}_{n}\right)\) at the node to be split, the mean of the squared residuals is typically used as the splitting criterion

$${\mathcal{Q}}=\frac{1}{n}\sum_{i=1}^{n}{\left(y\left({u}_{i}\right)-\overline{y }\right)}^{2},$$
(1)

where \(\overline{y} =\frac{1}{n}\sum_{i=1}^{n}y\left({u}_{i}\right)\) is the local mean of the response variable and \(n\) is the number of patterns at the node to be split. For a categorical response variable with \(K\) levels, a typical splitting criterion is the Gini impurity index

$${\mathcal{Q}}=1-\sum_{k=1}^{K}{\widehat{p}}_{k}^{2},$$
(2)

where \({\widehat{p}}_{k}=\frac{1}{n}\sum_{i=1}^{n}I\left(y\left({u}_{i}\right)=k\right)\) is the proportion of class \(k\) in the node to be split, and \(I\left(y\left({u}_{i}\right)=k\right)=1\) if \(y\left({u}_{i}\right)=k\) and 0 otherwise. The splitting criteria for left and right descendent nodes, \({{\mathcal{Q}}}_{L}\) and \({{\mathcal{Q}}}_{R}\), and the associated sample sizes, \({n}_{L}\) and \({n}_{R}\), are calculated for each split candidate. The best split is that which minimises \({{\mathcal{Q}}}_{split}={n}_{L}{{\mathcal{Q}}}_{L}+{n}_{R}{{\mathcal{Q}}}_{R}\). The splitting process continues until a stopping criterion is met. For instance, the recursive splitting stops when the sample size of the node is less than a predefined threshold. Predicted values of the response variable at the terminal nodes are given by \(\widehat{h}\left(x\left(u\right)\right)=\frac{1}{n}\sum_{i=1}^{n}y\left({u}_{i}\right)\) for regression and \(\widehat{h}\left(x\left(u\right)\right)={argmax}_{y\left(u\right)}\sum_{i=1}^{n}I\left(y\left({u}_{i}\right)=y\left(u\right)\right)\) for classification. The predictors \({x}^{p}\left(u\right)\), \(p=\left\{1,\ldots ,P\right\}\), of a new multivariate pattern \(pat\left(u\right)\) are used to determine a terminal node, and \(\widehat{h}\left(x\left(u\right)\right)\) is used as the prediction.

2.3 Spatial Random Forests

The SRF model uses the proposed spatial trees as the base learners. Using the input data \({\mathcal{D}} = \left\{\left(pat\left({u}_{1}\right),y\left({u}_{1}\right)\right),\dots ,\left(pat\left({u}_{N}\right),y\left({u}_{N}\right)\right)\right\}\), the base learners \({\widehat{h}}_{j}\left(x\left(u\right),{\theta }_{j},{{\mathcal{D}}}_{j}\right)\), \(j=1,\dots ,J\), are trained to learn the input multivariate spatial patterns. The random component \({\theta }_{j}\) is used to inject randomness to the base learners by fitting each spatial tree to an independent bootstrap sample (i.e., \({{\mathcal{D}}}_{j}\)) of \({\mathcal{D}}\) and finding the best split over \(m\) randomly selected predictors. Sampling the predictors for node splitting is particularly relevant in the case of high-dimensional data, where data consist of a very large number of predictors compared to the small number of observations. The SRF prediction for a new multivariate pattern \(pat\left(u\right)\) is given by \({\widehat{f}}(x(u))=\frac{1}{J}\sum_{j=1}^{J}{\widehat{h}}_{j}(x(u))\) for regression and \(\widehat{f}\left(x\left(u\right)\right)={argmax}_{y\left(u\right)}\sum_{j=1}^{J}I\left({\widehat{h}}_{j}\left(x\left(u\right)\right)=y\left(u\right)\right)\) for classification. In the case of classification, instead of the abstract class prediction based on majority voting for a \(pat\left(u\right)\), one can look at the fractions of the total number of trees \({p}_{k}\left(u\right)\) which vote for each class \(k\), \(k=\left\{1,\ldots ,K\right\}\). These predicted fractions can be used to calculate the local entropy (Shannon 1948; Goovaerts 1997) as a measure of local uncertainty

$$H\left(u\right)=-\frac{1}{ln\left(K\right)}{\sum }_{k=1}^{K}{p}_{k}\left(u\right).\left[ln\left({p}_{k}\left(u\right)\right)\right],$$
(3)

where \(H\left(u\right)\in \left[{0,1}\right]\). Locations where the entropy is close to 1 have high uncertainty, and the entropy is greatest when all fractions \({p}_{k}\left(u\right)\) are equal.

For each bootstrap sample \({{\mathcal{D}}}_{j}\) and tree prediction\({\widehat{h}}_{j}\left(x\left(u\right)\right)\), some patterns \(pat\left({u}_{i}\right)\) do not make it into the sample. These remaining patterns are called out-of-bag (OOB) and can be used to estimate the generalisation error of the SRF model and to measure predictor importance. For each pattern\(pat\left({u}_{i}\right)\),\(i=1,\dots ,N\), a prediction of the response variable is obtained using the \({J}_{i}\) spatial trees for which \(pat\left({u}_{i}\right)\) is OOB,\({d}_{i}=\left\{j : \left(pat\left({u}_{i}\right),y\left({u}_{i}\right)\right)\notin {{\mathcal{D}}}_{j}\right\}\). The OOB predictions are given by \({\widehat{f}}_{OOB}\left(x\left(u\right)\right)=\frac{1}{{J}_{i}}\sum_{j\in {d}_{i}}{\widehat{h}}_{j}\left(x\left(u\right)\right)\) for regression and \({\widehat{f}}_{OOB}\left(x\left(u\right)\right)={{argmax}}_{y\left(u\right)}\sum_{j\in {d}_{i}}I\left({\widehat{h}}_{j}\left(x\left(u\right)\right)=y\left(u\right)\right)\) for classification. The generalisation error for a regression model is typically estimated via the out-of-bag mean squared error

$${MSE}_{OOB}=\frac{1}{N}\sum_{i=1}^{N}{\left(y\left({u}_{i}\right)-{\widehat{f}}_{OOB}\left(x\left({u}_{i}\right)\right)\right)}^{2}.$$
(4)

Similarly, the generalisation error for a classification model is estimated using the OOB error rate

$${E}_{OOB}=\frac{1}{N}\sum_{i=1}^{N}I\left(y\left({u}_{i}\right)\ne {\widehat{f}}_{OOB}\left(x\left({u}_{i}\right)\right)\right).$$
(5)

The strategies used for tuning the hyperparameters of a classical RF model (Probst et al. 2019) can also be implemented for the proposed SRF model. However, due to the underlying structure of the SRF model that increases the dimension of the input data to capture spatial patterns, automated hyperparameter optimisation is computationally demanding.

Measuring the importance of predictors is useful for variable selection and for interpreting the SRF model. For the input patterns \(pat\left({u}_{i}\right)\), \(i=1,\dots ,N\), the OOB predictions \({\widehat{f}}_{OOB}\left(x\left(u\right)\right)\) are obtained. The values of a predictor of interest \({x}^{p}\left(u\right)\) are randomly permuted in the OOB patterns and the OOB predictions are recalculated. The difference between the error rates (classification) or mean squared errors (regression) of the predictions obtained from the original OOB patterns and those obtained using the permuted data gives a measure of importance for the predictor \({x}^{p}\left(u\right)\). The same procedure is used to measure the importance of other predictors. The predictor importance estimates the increase of error rate or mean square error of a predictive model on a test set when values of a predictor of interest are randomly permuted. The spatial approach to measure predictor importance not only ranks the predictors but also measures the zone of influence for each regionalised variable \({x}^{r}(u)\), \(r=\left\{1,\dots ,R\right\}\). The zone of influence for a regionalised variable \({x}^{r}(u)\) shows the importance of local information at \({E}_{r}\) nodes surrounding a central node \({u}_{i}\). The zone of influence can be used to assess the anisotropic behaviour and scale of the spatial patterns and structures. The provided information regarding the significance of the \({E}_{r}\) surrounding nodes can be used to define the structure or geometry of data events in multiple point geostatistics and also to assess training images (Mariethoz and Caers 2015).

Missing data occur frequently and require appropriate handling. For instance, the spatial patterns at the margins of a study area may contain many missing values. In the proposed algorithm, missing data imputation is conducted temporarily at each splitting node. To assess the splitting criterion for a predictor \({x}^{p}\left(u\right)\), the existent missing values are imputed by sampling from the known in-bag values at the node to be split. The imputed values are only used to assign the patterns with missing values to the left or right descendent nodes. The imputed values are removed from the descendent nodes after splitting (Ishwaran et al. 2008).

2.4 Unsupervised Learning

Geoscience data typically hold informative statistical and spatial patterns. Consequently, such statistical and spatial patterns should be distinguishable from a randomly generated version of themselves. A synthetic set of multivariate patterns is generated by introducing two steps of randomisation. First, for each regionalised variable \({x}^{r}(u)\), \(r=\left\{1,\dots ,R\right\}\), the \(N\) univariate patterns \({pat}_{r}\left({u}_{i}\right)\) are randomly sampled (without replacement). Then, the \({E}_{r}\) nodes of each sampled pattern \({pat}_{r}\left({u}_{i}\right)\) are permuted (Fig. 1(g)). The random sampling of the univariate patterns removes the spatial cross-correlations while the permutation of the \({E}_{r}\) locations removes the remaining auto-correlations in the synthetic patterns. The synthetic pattern for each location \({u}_{i}\) is stored as a long vector \({pat}^{*}\left({u}_{i}\right)=\left[{pat}_{1}^{*}\left({u}_{i}\right),\dots , {pat}_{R}^{*}({u}_{i})\right]\). The final dataset for unsupervised learning is constructed by concatenating the synthetic vectorised multivariate patterns below the original ones and consists of \(2\times N\) rows of vectorised multivariate patterns and \(P\) columns of predictors. A binary SRF classifier (introduced in Sect. 2.3) is trained to discriminate the original multivariate patterns from the synthetic ones. An OOB error rate lower than 50\% indicates the presence of coherent statistical and spatial patterns in the input spatial data and the capability of the SRF model to capture those patterns accurately.

The \(N\) patterns \(pat\left({u}_{i}\right)\) are modelled by each spatial tree, and if a pair of multivariate patterns, \(pat\left({u}_{i}\right)\) and \(pat\left({u}_{j}\right)\), i, j, (1, 2, …, N), ends in the same terminal node, their proximity is increased by 1. The proximities are averaged over \(J\) trees in the SRF classifier to generate the spatially aware \(N\times N\) proximity matrix. The principle is that any two similar multivariate spatial patterns will follow the same paths in different spatial decision trees. The proximity between two patterns is a measure of how close together they are in spatial predictor space, but it automatically gives more weight to important spatial predictors that are useful for identifying the underlying statistical and spatial structures of geoscience data. The proximities between multivariate patterns are real numbers bounded between 0 and 1. By subtracting 1 from the proximities, a positive semi-definite dissimilarity matrix \(D\) is constructed that can be used for unsupervised analysis.

To understand various patterns and spatial structures in the study area, the spatially aware dissimilarity matrix \(D\) is used to perform a classical multidimensional scaling (MDS) and to obtain a simplified two- or three-dimensional plot (Mead 1992). Each point on such a plot represents one of the multivariate patterns \(pat\left({u}_{i}\right)\), and the distances between the points reproduce, as closely as possible, the proximity-based distances in \(D\). Several subgroups of patterns or outlier patterns (e.g., gold mineralisation) are easily captured from the MDS plots. The spatially aware dissimilarity matrix \(D\) can also be used to cluster the geoscience data and define the underlying natural domains. In this study, the partitioning around medoids technique (Kaufman and Rousseeuw 1990) is used for this purpose.

Several techniques can be used to assess the quality of clustering and to select the optimum number of clusters (Kassambara 2017). In this study, the silhouette width is used to assess the performance of the proposed unsupervised SRF. For each location \({u}_{i}\), the silhouette width is calculated as

$$S\left({u}_{i}\right)=\frac{b\left({u}_{i}\right)-a\left({u}_{i}\right)}{max\left(a\left({u}_{i}\right),b\left({u}_{i}\right)\right)},$$
(6)

where \(a\left({u}_{i}\right)=\frac{1}{\left|{C}_{p}\right|-1}\sum_{{u}_{j}\in {C}_{p}, i\ne j}d\left(pat\left({u}_{i}\right),pat\left({u}_{j}\right)\right)\) is the average distance between the multivariate pattern located at \({u}_{i}\) and all other patterns at locations \({u}_{j}\) of the cluster \({C}_{p}\) to which \({u}_{i}\) belongs. The smaller the value of \(a\left({u}_{i}\right)\), the better the assignment of location \({u}_{i}\) to a cluster \({C}_{p}\). The parameter \(b({u}_{i})= {\min}_{q\in ({1,2},\ldots ,M) \text{ and } q\ne p}\left(\frac{1}{\left|{C}_{q}\right|}\sum_{{u}_{j}\in {C}_{q}}d\left(pat\left({u}_{i}\right),pat\left({u}_{j}\right)\right)\right)\) is the minimum of the average distances between the pattern at location \({u}_{i}\), and all patterns at locations \({u}_{j}\) belong to the other clusters in which \({u}_{i}\) is not a member. This parameter can be seen as the distance between the location \({u}_{i}\) and the closest cluster to which \({u}_{i}\) does not belong. These parameters can be easily calculated from the spatially aware dissimilarity matrix \(D\). Locations with a \(S\left({u}_{i}\right)\) close to 1 are well clustered. As \(S\left({u}_{i}\right)\) approaches −1, the location \({u}_{i}\) is probably in the wrong cluster. Finally, \(S\left({u}_{i}\right)\) close to 0 means that the location \({u}_{i}\) is between two clusters. The silhouette width can be averaged over the study area \(A\) for different numbers of clusters \(M\). The optimal number of clusters is the one that maximises this average.

In the case of large datasets such as high-resolution satellite images, the dissimilarity matrix \(D\) can be enormous. Calculating, storing, and analysing such large matrices are challenging. To handle such large datasets, a representative sample of the vectorised multivariate patterns is taken, and the proposed method is applied to this subset to define the cluster numbers. Subsequently, an SRF classifier is trained using the sampled vectorised multivariate patterns as input predictors and the cluster numbers as the new categorical response variable. This multiclass classifier is later used to predict the cluster numbers across study area \(A\).

3 Implementation and Results

3.1 SRF Unsupervised Learning

The aim of this experiment is to investigate the capability of the unsupervised SRF model for capturing complex spatial patterns. A synthetic dataset is used in this experiment which consists of two continuous variables with the same resolution (Fig. 3a, b). In this case study, the resolution of the clustering grid is the same as that of the input variables (\(N={10,000}\)). Unconsolidated sediments, marshes, and channels are the three spatial structures that need to be captured by unsupervised learning (Fig. 3a). The second variable is a Gaussian noise and is used to assess the performance of the proposed SRF model in the presence of noisy variables without any coherent spatial structure. The same order of spatial statistics was selected for the two input variables, \({E}_{1}={E}_{2}=81\) (Fig. 3(d) and (e)). The default value was selected for the parameter \(m=ceiling \left(\sqrt{\sum_{r=1}^{2}{E}_{r}}\right)=13\). The number of spatial decision trees was set to \(J={1{,}000}\). The error rate \({E}_{OOB}=0.024\) demonstrates the presence of coherent statistical and spatial structures in the input spatial data and the ability of the SRF model to discriminate the original spatial patterns from the synthetic ones. The average silhouette width suggests three spatial clusters (Fig. 3c), which is consistent with the prior knowledge. Comparing the zone of influence for the two input variables reveals the robustness of the SRF model in terms of automatically ignoring the noisy variables without any coherent spatial structures (Fig. 3d, e). The spatially aware dissimilarity matrix \(D\) was used to visualise the level of similarity between the multivariate spatial patterns in the study area via the multidimensional scaling approach. The first coordinate in Fig. 3f discriminates marshes and channels from the unconsolidated sediments, whereas the second coordinate discriminates marshes from channels. Classical RF (\({E}_{1}={E}_{2}=1\)) was also implemented to assess the gain in the implementation of a spatially aware learner. The final clustered map via a classical RF model and the proposed SRF model can be seen in Fig. 3g, h respectively. The SRF model recognised the underlying geological structures, while the nonspatial RF model only captured unstructured noise. Finally, Fig. 3i shows the silhouette width \(S\left(u\right)\) across the study area. According to this map, the channels are very well clustered, whereas the transitions between channels and marshes show lower silhouette width.

Fig. 3
figure 3

a and b input variable #1 and #2 respectively, c optimum number of clusters based on average silhouette width, d and e spatial predictor importance for the input variable #1 and #2 respectively, f multidimensional scaling plot coloured by the predicted spatial clusters via the SRF model, g final clusters generated via classical RF model, h spatial clusters generated via the SRF model, and i the estimated silhouette width \(S\left(u\right)\) across the study area

3.2 Automated Geology Mapping and Geochemical Prediction

3.2.1 Study Area and Initial Data

The overall aim of the experiments with real geoscience data is to investigate how aspects of the SRF model can support geoscientists to make quicker and more informed interpretations by semi-automated or automated analysis of multi-source spatial datasets. The study area covers a small part of the North West Minerals Province (NWMP) in Queensland (Fig. 4). The first experiment aims at digital geological mapping using geophysical covariates and an interpreted geology map (SRF classification), while the second experiment attempts to predict the concentration of Cu as an indicator of possible mineralisation in rock formations using in situ geochemical soil samples (SRF regression).

Fig. 4
figure 4

The input spatial data to train predictive models. a to g Continuous regionalised variables, h categorical response variable, and i continuous response variable. To enhance the visualisation of spatial variables, the colour scales for the continuous variables are defined according to quantile (equal probability) classes

All the geophysical covariates were acquired from the Geological Survey of Queensland (GSQ, https://geoscience.data.qld.gov.au). The digital elevation model (DEM, Fig. 4(a)) was obtained from the 1 arc second DEM of bare-earth with attempted removal of vegetation and man-made structure effects (Gallant et al. 2011). The variable reduction to the pole (VRTP) correction was applied to the merged total magnetic intensity grid (Fig. 4b). Apart from the VRTP map, the first vertical derivative (IVD) of the magnetic field (Fig. 4c) was also used to enhance the resolution of closely spaced and superposed anomalies (Greenwood 2018). The gamma-ray spectrometry data (Fig. 4d–f) derived from Geoscience Australia’s Third Edition Radiometric Map of Australia (Minty et al. 2010). The Bouguer anomaly gravity grid (Fig. 4(g)) represents a gridded compilation of all freely available gravity observations from exploration companies and state and federal regional surveys maintained by the GSQ (Cant 2014). The simplified solid geology (interpreted by GSQ, Fig. 4(h)) is used as the categorical response variable for the classification experiment. The in situ soil geochemistry data were obtained from the Queensland Government Department of Natural Resources and Mines (QDEX 2018). The Cu samples from 3,755 locations across the study area are used to train the random forests regressors (Fig. 4(i)). Although dependent and independent layers can have different spatial resolutions in the proposed model, the resolution of the predictors and the response variables is 90 m in this case.

3.2.2 SRF Classification

Geological maps are normally products of field observations and interpretation of geophysical and remote sensing datasets, as well as expert background knowledge about the geological history of the map area. This experiment investigates the suitability of the SRF model to provide a repeatable and faster method for automated geology mapping. The model utilises the existing geological interpretations (Fig. 4(h)) by drawing samples from it and demonstrates how the published geology can be replicated when all possible geological classes are known a priori and training samples are available from each class. A balanced training dataset was obtained by randomly sampling 84 pixels from each geological class (\(N=672\)). The overall training pixels constitute 1\% of the interpreted geology map (Fig. 4(h)). The remaining 99\% of the pixels will be used to validate the predictions.

The number of bootstrap samples (number of spatial decision trees) is set to \(J = {1{,}000}\). The number of regionalised variables is R \(=7\) (Fig. 4a–g), and \({E}_{r}=100\) for all covariates. The overall number of spatial predictors is \(P=\sum_{r=1}^{7}{E}_{r}=700\). The default value was selected for the parameter \(m=ceiling \left(\sqrt{\sum_{r=1}^{7}{E}_{r}}\right)=27\) in the classification framework. The OOB error rate for the SRF classifier is \({E}_{OOB}=0.134\). The SRF model returns a measure of predictor importance that reveals which input covariates provide the most value (Fig. 5). The gravity layer and thorium concentrations from the gamma-ray spectrometric data are more important than the other covariates for predicting the true underlying geology at the selected scale (900 m × 900 m). The zone of influence for each regionalised covariate can be investigated further. For instance, the gravity zone of influence shows long-range spatial correlations, which means that this covariate captures large geological structures with specific gravity features. Unlike the gravity layer, the zone of influence for magnetic VRTP is anisotropic and shows a clear orientation of the geological patterns with specific shallow magnetic responses at the selected scale. Finally, the thorium layer captures isotropic small-scale geological structures with exceptionally high thorium concentration.

Fig. 5
figure 5

Spatial predictor importance and the zone of influence for the input covariates obtained from the SRF classifier

A classical RF classifier (\({E}_{r}=1; r=\left\{1,\dots ,7\right\}\)) was also implemented to compare the results of the SRF model with a nonspatial ensemble leaner. Figure 6a and b show the predicted surface geology by the SRF and RF classifiers respectively. The misclassified pixels (based on the 99\% validation pixels) are shown in Fig. 6c. There is a higher proportion of misclassified pixels for the RF predicted geological map compared to the SRF results. The SRF predicted geology map shows greater spatial continuity, whereas the RF map displays a distinct lack of spatial continuity and a noisy appearance. Figure 6d shows the omnidirectional experimental variograms for geological class #9. The larger nugget effect and shorter variogram range generated by the RF model show the limitations of nonspatial learners in reproducing the spatial continuity of geoscience data. The SRF model was able to correctly reproduce the geological picture with similar details as the interpreted geology map (Fig. 4h). The class-wise and overall error rates for the two models are shown in Table1. All the geological classes were predicted with higher accuracy using the SRF model. Geological classes #13, #4, and #33 show the lowest error rates, while class #9 shows the highest error rate in the SRF model. The local uncertainty of the geological class predictions can be measured by the standardised entropy (Eq. 3). The entropy maps can be used interactively during the mapping process to gain valuable information during fieldwork through the identification of areas of high variability and uncertainty. Figure 6e and f show the entropy maps for the SRF and RF models respectively. Compared to the RF model, geological predictions show lower uncertainty in the SRF model. Geological class #33 shows the lowest level of uncertainty, which might be related to the physical homogeneity of this unit. The transitions from class #13 to classes #9 and #35 and boundaries of class #5 show high uncertainty (Fig. 6e). High uncertainty of transition zones is partly related to the fact that geological units are also interpreted models themselves utilising the same geophysical covariates. Deterministic and subjective approaches to generate geological maps can be efficiently replaced with objective approaches capable of modelling uncertainty such as SRF classification. This case study also revealed that the traditional geological maps can be revisited to highlight the areas with the highest level of uncertainty.

Fig. 6
figure 6

Comparing the results of the SRF and RF classifiers, a and b predicted geology maps via SRF and RF models respectively, c misclassified pixels, d omnidirectional experimental variograms for the geological class #9, e and f entropy maps for the SRF and RF models respectively

Table 1 Class-wise and overall error rate for the SRF and RF classifiers

This experiment suggests that if a field geologist was able to identify all potential geological classes within a given spatial domain successfully and collect a reasonable amount of ground truth observations, then an SRF model could be trained (using the relevant spatial covariates) to predict the spatial distribution of the geological classes. The SRF model can provide a meaningful automated prediction of geological units which could assist in geological mapping and help to guide field work while minimising the time needed to produce a geological map.

3.2.3 SRF Regression

This experiment describes an approach for the prediction of in situ soil geochemistry (Cu concentration) even under deeper regolith that covers the bedrocks. The relationships between surface geochemical measurements and subsurface geophysical data are investigated. Mineral exploration in such covered areas strongly depends on the use of subsurface geophysical data as they can indirectly map the deep and covered geological structures that control mineralisation. However, the influence of cover on geophysical data was not considered in this study. The SRF regressor was used to model Cu concentration across the study area based on subsurface geophysical covariates only (i.e., magnetic and gravity layers).

The training samples consist of \(N={3{,}755}\) Cu observations that are mostly located in the southern part of the study area (Fig. 4(i)). The number of spatial decision trees in the SRF regressor is set to \(J={1{,}000}\). The number of regionalised variables is R \(=3\) (Fig. 4(b), (c) and (g)), and \({E}_{r}=100\) for all covariates. The overall number of spatial predictors is \(P=\sum_{r=1}^{3}{E}_{r}=300\). The default value was selected for the parameter \(m=ceiling \left(\frac{\sum_{r=1}^{3}{E}_{r}}{3}\right)=100\) in the regression framework. The SRF regressor returned a measure of predictor importance that reveals the most significant covariates for Cu prediction (Fig. 7).

Fig. 7
figure 7

Spatial predictor importance and the zone of influence for the input covariates obtained from the SRF regressor

The gravity layer is the most important covariate (Fig. 7) for predicting Cu concentration. The zone of influence for gravity covariate shows isotropic behaviour at small scales (less than 300 m) and anisotropic behaviour (NW–SE and NE–SW trends) at larger scales (greater than 900 m). The first vertical derivative (IVD) of the magnetic field is more important than the original magnetic VRTP layer for predicting Cu concentration. This may be partly related to the complex relationships between shallow magnetic features and Cu mineralisation. The zone of influence for magnetic IVD covariate is anisotropic, and the anisotropy directions change with scale.

The results of the SRF regressor were compared to those from a classical RF regressor, RF with buffer distances to observations as extra covariates (RFsp, Hengl et al. 2018), and the geographical random forests (GRF, Georganos et al. 2019). The classical RF regressor is based on three geophysical covariates (\({E}_{r}=1; r=\left\{1,\dots ,3\right\}\)) in Fig. 4(b), (c) and (g). In addition to these three covariates, the RFsp model uses Euclidean buffer distances to 375 (10\% of the total observations) randomly selected samples. Multiple RFsp models were built by iterating the random sampling of the Euclidean buffer covariates, and the most accurate RFsp model (i.e., lowest \({MSE}_{OOB}\)) was selected for prediction. Finally, the GRF model predicts the response values by fusing the results from multiple local sub-RFs and a global RF to account for spatially heterogeneous data. The local sub-RFs are trained using a number of nearby observations (i.e., adaptive kernel). The most accurate GRF model was obtained by setting the number of nearest neighbours in the adaptive kernel to 400 and the fusing weights to 0.4 and 0.6 for the local and global models respectively. To implement the GRF analyses, the R package ‘SpatialML’ (https://cran.r-project.org/web/packages/SpatialML/index.html) was used.

The number of trees in all the experiments was set to 1,000. The default value (one third of the covariates) was selected for the number of covariates randomly sampled as candidates at each split in the trees.

The generalisation errors for the four random forests experiments were estimated via the out-of-bag mean squared error. The SRF regressor showed superior performance (\({MSE}_{OOB}={132{,}391}\)) compared to other models in terms of predicting OOB Cu concentrations. The GRF regressor was the second most accurate model (Table 2).

Table 2 Out-of-bag mean squared error for the four random forests regressors

Figure 8 shows the Cu prediction via the four regressors. The models are different, especially in terms of extrapolation under cover. For instance, unlike the SRF model, the RF regressor predicted an anomalous area in the southwestern part of the study area. Several unrealistic artefacts can be seen in the maps generated by RFsp and GRF models. Consequently, these models are not recommended for extrapolation under deeper regolith that is the main goal of this case study.

Fig. 8
figure 8

Cu prediction via four random forests regressors

The correlation between true Cu concentrations and Cu prediction (OOB cases) via the RF model is 0.518, while this correlation is 0.685 for the SRF model (Fig. 9(a) and (b)). The correlations for the RFsp and GRF models are 0.491 and 0.502 respectively (Fig. 9(c) and (d)). The conditional bias assessment for the four models can be seen in Fig. 9(e) to (h), where the errors (\({\widehat{f}}_{OOB}\left(x({u}_{i})\right)-y\left({u}_{i}\right)\)) are plotted against the observed Cu concentrations. The lowest conditional bias was achieved by the SRF model.

Fig. 9
figure 9

a to d Scatterplots of the true Cu concentrations and random forests predictions (OOB cases). e to h Distribution of errors against true Cu concentrations. The red lines are the locally weighted polynomial regressions and the scatterplots are coloured by kernel density estimates

Based on the implemented performance analyses, the predicted Cu map obtained from the SRF model is more accurate and reliable for Cu exploration. This experiment showed the potential of using geophysical datasets and soil assay samples to gain insight into the surface and subsurface geochemistry.

4 Discussion

Unlike other attempts to improve the spatial awareness of the tree-based learners, the proposed spatial random forests model is capable of learning complex spatial patterns and using such knowledge during the classification, regression, and unsupervised modelling tasks. Multi-resolution covariates can be used as spatial predictors without any preprocessing (e.g., upscaling all covariates to a consistent resolution). The spatial framework for measuring the predictor importance provides valuable knowledge about the underlying spatial structures of predictors that control the patterns in the response variable. For instance, anisotropic behaviour at various scales can be captured by calculating the zone of influence for each regionalised variable. The unsupervised SRF model is very robust against noisy covariates without meaningful auto- and/or cross-correlations. The synthetic case study reveals the robustness of the SRF model in terms of automatically ignoring such noisy covariates.

Multi-resolution gridded data can be used as input to the SRF model. However, if the input data do not lie on a regular grid, they should first be migrated to the closest nodes of a regular grid of suitable resolution or be rasterised using geostatistical techniques. The SRF model is not suitable for sparse scattered data unless representative training images are available for training. Missing values are allowed, and no separate imputation step is necessary.

Compared to the classical nonspatial random forests algorithm, the proposed spatial model is computationally demanding. However, the SRF algorithm is easily parallelisable, and parallel computing with multi-core processors can speed up such intensive calculations.

Although the spatial continuity of the predicted response variable is more realistic in the SRF model compared to the noisy outcomes predicted by nonspatial learners such as RF, some artefacts or spatial discontinuity can be seen. Such spatial discontinuities are partly related to the prediction function. The response values are predicted for each pixel independent of the predictions in nearby pixels. The pixelwise and independent prediction is not consistent with the spatial dependency of response values. The other limitation is the nonspatial bootstrapping or random sampling for training the base learners. Since the spatial and/or temporal response variables are correlated through space and time, nonspatial bootstrapping or random sampling leads to over-optimistic learners. These limitations are the topics of future studies.

5 Conclusions

A truly spatial random forests technique based on higher-order spatial statistics was introduced in this study. The proposed non-parametric technique is capable of learning multivariate spatial patterns of mixed type (i.e., continuous and categorical spatial patterns) and capturing complex nonlinear relationships. The algorithm measures the predictor importance internally and estimates the zone of influence for each regionalised variable. Unlike deep learning techniques, the high-dimensional systems in which the number of predictors is much larger than the number of observations can be handled appropriately via the SRF model. The algorithm accepts multi-resolution covariates without any preprocessing (i.e., interpolation, upscaling, and imputation). The algorithm can be used for both supervised and unsupervised learning. The case studies demonstrate applications where a spatial learner has the potential to improve, aid, or automate existing processes around the preparation and validation of geoscience data and interpretations.