1 Introduction

Gaussian processes (GPs) provide a useful tool for regression in supervised machine learning (Rasmussen and Williams 2006). The range of applications includes geophysics (Ray and Myer 2019), mining (Leung et al. 2022), hydrology (Yang et al. 2018), ecological monitoring (Grbić et al. 2013), robotics (Deisenroth et al. 2013), multi-sensor fusion (Osborne et al. 2008; Melkumyan et al. 2011) and remote sensing (Chlingaryan et al. 2016). In standard GP models, stationary covariance functions are generally used. This means that the covariance between any two points depends only on the lag, or Euclidean distance, irrespective of the location. A potential limitation with stationary GPs is that they may fail to adapt to variable smoothness in the function of interest (Gibbs 1998; MacKay 1998). According to Paciorek and Schervish (2003), this may be encountered in geophysical and geographical applications where domain knowledge suggests that the function may vary more quickly in some parts of the input space than in others. For example, in mountainous areas, environmental variables are likely to be much less smooth than in flat regions. For mining purposes, kriging is usually used. The kriged surface will basically be as smooth as possible given the constraints of the data—in many cases, probably smoother than the true surface (Bohling 2005). Spatial statistics researchers have made some progress in defining nonstationary covariance structures for kriging, a form of GP regression. The nonstationary covariance structure of Higdon et al. (1999)—for which Gibbs (1998) gives a special case—has been extended to a class of nonstationary covariance functions by Paciorek and Schervish (2003). However, nonstationary models are only as powerful as sample sufficiency allows. In data-deficient regions—which are commonly encountered in mining because assay sampling is sparse and costly—reliable estimation of the parameters may not even be possible. For this reason, this paper focuses on extending the capability of existing stationary covariance functions to aptly capture the inherent variability of geological/geochemical processes and produce high-quality GP regression results.

Fig. 1
figure 1

Motivation—extending a class of well-established stationary kernels to improve the structural integrity of GP regression results

As motivation, Fig. 1a shows the chemical concentration of iron (Fe) at a test site. Figure 1b illustrates the posterior mean distribution obtained using GP and the standard Matérn3/2 kernel, where the length scale and noise parameters indicated are found by maximizing the log-marginal likelihood. The main observation is the blurriness in the mean distribution. For geochemical data, GP regression results are often excessively smooth. Figure 1c highlights the main proposition of this paper: augmenting the standard kernels with a smoothing parameter \(\alpha \) enables a more fitting solution to be found—one vastly better at preserving spatial structures. Using a different \(\alpha \), Fig. 1c appears much sharper even though the remaining hyperparameters are identical. Indeed, the details missing in (b) relative to (c) can be seen from the residual image in Fig. 1d. In essence, this work introduces a new family of stationary covariance functions (\(K_\alpha \)) that are more capable of capturing the inherent variability of geochemical random processes than the existing stationary covariance functions (K). The baseline family of K includes standard covariance functions such as squared exponential (SE), exponential and Matérn. In contrast to most covariance functions, \(K_\alpha \) has the added flexibility of a parameter \(\alpha \) that controls the differentiability (level of smoothness) of sample functions drawn from the GP distribution. And this parameter can be automatically learned from the input data.

This article is organized as follows. Section 2 briefly introduces Gaussian processes as a framework for modeling geochemical distributions. It explains the concept of over-smoothing and defines the smoothness of functions in Sobolev space. Section 3 formulates the new stationary covariance functions \(K_\alpha \) and considers the valid intervals of \(\alpha \) that make the resultant kernels positive semi-definite. Section 4 describes the data used in the experiments, the learning and inference procedures, and how performance is evaluated. It reviews a statistical measure called the structural similarity index (SSIM) which will be used to validate and quantify the spatial fidelity of GP regression models. Section 5 presents an analysis of the results. Section 6 explores some connections from a Fourier perspective and delves deeper into the results. Finally, concluding remarks are given in Sect. 7.

2 Background

2.1 Gaussian Processes: A Probabilistic Framework

Gaussian processes (GPs) represent a nonparametric technique for building a probabilistic model of a continuous function given a set of observations at known locations. In this paper, the quantities of interest are the concentration of chemicals such as lead or zinc, mostly measured in parts per million (ppm). From the GP perspective, it considers any single function value (i.e. any point on the function) as Gaussian distributed. Hence, the stochastic function is completely characterized by a mean and a variance. This viewpoint sets it apart from deterministic interpolation such as a fitted surface obtained by least-squares optimization.

Mathematically, GP is an infinite collection of random variables, any finite number of which have a joint Gaussian distribution (Rasmussen and Williams 2006). Machine learning using GPs consists of two steps: training and inference. GPs usually contain initially unknown hyperparameters, and the training step is aimed at optimizing those hyperparameters to produce a probabilistic model that best represents the training data. The hyperparameters define the function only when considered together with the training data. The hyperparameters used in this work are the length scale, which describes the rate of change of the output, the amplitude and additional hyperparameter, which describes the level of smoothness of the predictive model. Once the optimal hyperparameters are found, they can be used during the inference step to predict the values of the function of interest at new locations. When creating a GP model, a covariance function must be chosen that can be used to help describe the relationship between the inputs and outputs. The covariance function will also define the number and type of hyperparameters that are needed for the model.

Using \(x_i\in \mathbb {R}^d\) and \(y_i\in \mathbb {R}\) to denote the input and output, respectively, the supervised learning problem uses a given N-point training set \(T=\{x_i,y_i\}_{i=1}^N\) to compute the predictive distribution \(f(x_*)\) at a new test point \(x_*\). A vector notation, \(\textbf{x}_*\), is used to represent a collection of test points. Since the GP model places a multivariate Gaussian distribution over the space of function variables \(f(\textbf{x})\), the GP is fully specified by its mean function \(m(\textbf{x})\) and covariance function \(k(\textbf{x},\textbf{x}'):\,f(\textbf{x})\sim GP(m(\textbf{x}),k(\textbf{x},\textbf{x}'))\). Suppose the function values are unknown and to be determined at the test points \(T_*=\{x_{*i},y_{*i}\}_{i=1}^M\). The joint Gaussian distribution with zero mean function and covariance function K is

$$\begin{aligned} \begin{bmatrix}\textbf{f} \\ \textbf{f}_*\end{bmatrix} = \mathcal {N}\left( \textbf{0},\, \begin{bmatrix} K(X,X) &{} K(X,X_*)\\ K(X_*,X) &{} K(X_*,X_*) \end{bmatrix}\right) , \end{aligned}$$
(1)

where \(\textbf{f}\) and \(\textbf{f}_*\) are noise-free values of the function, \(\mathcal {N}(\mu ,\Sigma )\) is a multivariate Gaussian distribution with mean \(\mu \) and covariance \(\Sigma \), and K is used to denote the covariance matrix computed between all points in the set. In particular, the covariance matrix between observed and unobserved locations has the form

$$\begin{aligned} K(X,X_*) = \begin{bmatrix} k(x_1,x_{*1}) &{} \ldots &{} k(x_1,x_{*M})\\ \vdots &{} \ddots &{} \vdots \\ k(x_N,x_{*1}) &{} \ldots &{} k(x_N,x_{*M}) \end{bmatrix} = K(X_*,X)^T. \end{aligned}$$
(2)

If we assume observations with Gaussian noise \(\varepsilon \) and noise variance \(\nu ^2\) such that \(y=f(x)+\varepsilon \), then the joint distribution becomes

$$\begin{aligned} \begin{bmatrix}\textbf{y} \\ \textbf{f}_*\end{bmatrix} = \mathcal {N}\left( \textbf{0},\, \begin{bmatrix} K(X,X)\!+\!\nu ^2 I &{} \,K(X,X_*)\\ K(X_*,X) &{} \,K(X_*,X_*) \end{bmatrix}\right) . \end{aligned}$$
(3)

In this model, the measurement noise is assumed to be unbiased, independent and identically distributed. The key predictive equations for GP regression can be obtained by conditioning on the observed training points (Murphy et al. 2014). The resulting predictive distribution for the points being estimated can be obtained as

$$\begin{aligned} p(\textbf{f}_*\!\mid \!(X_*,X),\textbf{y})=\mathcal {N}(\mu _*,\Sigma _*), \end{aligned}$$
(4)

where the predictive mean and covariance are given via the formulas

$$\begin{aligned} \mu _*&=K(X_*,X)\left[ K(X,X)+\nu ^2 I\right] ^{-1}\textbf{y}, \end{aligned}$$
(5)
$$\begin{aligned} \Sigma _*&=K(X_*,X_*)-K(X_*,X)\left[ K(X,X)+\nu ^2 I\right] ^{-1}K(X,X_*)+\nu ^2 I. \end{aligned}$$
(6)

The predicted mean value \(\mu _*\) in Eq. 5 is the main outcome of the GP regression. This also represents a solution to Kernel ridge regression where \(\nu ^2\) serves to alleviate overfitting (Akian et al. 2022). The diagonal of the matrix \(\Sigma _*\) defines the variance representing the uncertainty for those predictions.

Training a GP model is equivalent to learning the hyperparameters of the covariance function from a dataset. In the Bayesian framework, this can be performed by maximizing the log of the marginal likelihood (LML) with respect to \(\theta \)

$$\begin{aligned} \log p(\textbf{y}\mid X,\theta ){=}{-}\frac{1}{2}\textbf{y}^T\left[ K(X,X)+\nu ^2 I\right] ^{-1}\textbf{y} {-} \frac{1}{2}\log \left| K(X,X){+}\nu ^2 I\right| {-}\frac{N}{2}\log 2\pi , \end{aligned}$$
(7)

where \(\left| \,\cdot \,\right| \) denotes the determinant. The marginal likelihood in Eq. 7 contains three terms that represent (from left to right) the data fit, complexity penalty (to include the Occam’s razor principle) and normalization constant. The first two terms in Eq. 7 depend on the values of the hyperparameters. As the marginal likelihood is a non-convex function of the hyperparameters, only local maxima can be obtained (Melkumyan and Ramos 2009). Local maxima were obtained via gradient descent using multiple starting points.

2.2 Connections

The smoothing phenomenon seen in Fig. 1 has been observed in kriging and reported by Journel et al. (2000) and Yamamoto (2005), among others. The problem manifests as conditional bias, and more specifically, a tendency of underestimating large values and overestimating small values. This smoothing effect is as relevant for kriging interpolation (unbiased linear estimators) as it is for Gaussian process regression, since the GP mean in (5) may be expressed as a linear combination of kernel functions centered on the training points, with \(\alpha _i=\left[ K(X,X)+\nu ^2 I\right] ^{-1}\textbf{y}\) in (8).

$$\begin{aligned} \mu _*&=\sum _{i=1}^N \alpha _i k(\textbf{x}_*,\textbf{x}_i). \end{aligned}$$
(8)

For kriging, this smoothing effect is explained by a deficit of variance, \(\text {Var}\{Z(\textbf{x})\}\!-\!\text {Var}\{Z_K^*(\textbf{x})\}\!=\!\sigma _K^2\ge 0\), which increases as the distance increases between the estimated location and known data (Journel et al. 2000). This discrepancy is usually corrected during post-processing with a cross-validation procedure that aims to reproduce the semivariogram. In Olea and Pawlowsky (1996), the estimation error is compensated using linear regression, whereas in Yamamoto (2005), cross-validation is used to estimate the kriging interpolation variance and estimation error. Following a different path, Yao (1998) proposed a conditional simulation approach which imparts structural information (the correct covariance model) on the kriging estimate using Fourier coefficients in the spectral domain. This iterative approach removes smoothing artifacts, albeit at the cost of local accuracy (a poorer estimate vs. true value correlation). Readers are referred to Journel et al. (2000) for a discourse on the limitations of kriging, conditional bias, uncertainty assessment, and the conflicting objectives of retaining global accuracy (reproducing texture, structures or covariance) and preserving local accuracy (minimization of error variance).

Outside of geostatistics, kriging is sometimes seen through the lens of GP in the machine learning community. In particular, kriging models the residual component using a stationary GP with zero mean (Shekaramiz et al. 2019). For Gaussian processes, it is instructive to consider “smoothing” as a smoothness misspecification that yields a flatter GP mean function, \(\mu _*\), relative to the target function, f. Formally, the smoothness of a function can be measured in terms of the number of derivatives in Sobolev space \(W_p^k(\mathbb {R}^d)\) (Wynne et al. 2021). For the case of \(L^p\) norm with \(p=2\),

$$\begin{aligned} W_2^k(\mathbb {R}^d)=\left\{ f\in L^2(\mathbb {R}^d):\Vert f\Vert _{W_2^k(\mathbb {R}^d)}^2 :=\int _{\mathbb {R}^d}\left( 1+\Vert \omega \Vert _2^2\right) ^k |\hat{f}(\omega )|^2 d\omega < \infty \right\} , \end{aligned}$$
(9)

where \(\hat{f}(\omega )\) is the Fourier transform of f, \(\Vert \cdot \Vert _2\) denotes the Euclidean norm, and \(k>d/2\). A function f has smoothness \(m_0\) in the Sobolev sense if it is integrable and has a finite norm, \(\Vert f\Vert _{W_2^k(\mathbb {R}^d)}^2\), as specified in (9) for all \(k\in \mathbb {R}<m_0\) where \(m_0=\text {sup}\left\{ k\ge 0: f\in W_2^k(\mathbb {R}^d)\right\} \). The Fourier transform of the true auto-correlation function, \(\Phi \), and GP kernel-estimated correlation function, \(\Psi _\theta \), are assumed to satisfy the following bounds (Wang and Jing 2022)

$$\begin{aligned}&c_1\left( 1+\Vert \omega \Vert _2^2\right) ^{-m_0}\le \mathcal {F}(\Phi )(\omega ) \le c_2\left( 1+\Vert \omega \Vert _2^2\right) ^{-m_0}, \forall \omega \in \mathbb {R}^d, \end{aligned}$$
(10)
$$\begin{aligned}&c_3\left( 1+\Vert \omega \Vert _2^2\right) ^{-m_\theta }\le \mathcal {F}(\Psi _\theta )(\omega ) \le c_4\left( 1+\Vert \omega \Vert _2^2\right) ^{-m_\theta }, \forall \omega \in \mathbb {R}^d. \end{aligned}$$
(11)

The \(\theta \) subscript emphasizes that the auto-correlation estimate is obtained using a kernel with parameters \(\theta \). Over-smoothing occurs when \(m_\theta > m_0\). This has a clear interpretation based on the Wiener–Khintchine theorem (for a wide-sense stationary process, the power spectrum of f is given by the Fourier transform of its autocorrelation function, \(\mathcal {F}(\Phi )\)), it indicates the kernel power spectrum decays with spatial frequencies at a faster rate than the target function. \(\mathcal {F}(\Psi )\) may be rewritten in product form as shown in (12). Applying the convolution theorem and conjugate property, this is equivalent to convolving the true autocorrelation function with an equivalent kernel (interpolator) in the spatial domain.

$$\begin{aligned} \mathcal {F}(\Psi )(\omega ) = \mathcal {F}(\Phi )(\omega )|Q_{K_\theta }(\omega )|^2. \end{aligned}$$
(12)

This expression shows the unknown power spectrum \(\mathcal {F}(\Phi )(\omega )\) is shaped by an equivalent kernel frequency response, \(Q_{K_\theta }(\omega )\), which depends on the kernel hyperparameters \(\theta \). \(Q_{K_\theta }(\omega )\) represents a decreasing function that decays at a rate of \(1/(1+\Vert \omega \Vert ^2)^s\). The main source of smoothing considered in this paper is a significant mismatch in the rate of decay, namely \(m_\theta \gg m_0\), which contributes to over-smoothing in the GP mean function estimate. The proposed \(\alpha \) mechanism allows this decay rate to be adjusted to compensate for over-smoothing.

3 Formulation

This section considers the properties of a covariance function that impact the smoothness of models. The formula for the augmented stationary covariance functions \(K_\alpha \) is presented, and the valid intervals for the smoothness parameter (\(\alpha \)) are determined.

3.1 Kernel Attributes that Affect the Smoothness of Models

The level of smoothness of the stochastic process \(f(\textbf{x})\) generated by the GP impacts the level of the smoothness of the models produced by GP inference. The smoothness of the process \(f(\textbf{x})\) depends on the smoothness of its covariance function \(k(\textbf{x},\textbf{x}')\), where \(\textbf{x},\textbf{x}'\in \mathbb {R}^d\). The first observation regarding mean square continuity and differentiability is attributed to Rasmussen and Williams (2006). For stationary processes, if the 2nth order partial derivative \(\partial ^{2n}{k(\textbf{x})}/\partial ^2 x_{i1}\ldots x_{in}\) exists and is finite at \(\textbf{x}=0\), then the nth order partial derivative \(\partial ^{n}{k(\textbf{x})}/\partial x_{i1}\ldots x_{in}\) exists for all \(\textbf{x}\in \mathbb {R}^d\) as a mean square limit. It is the properties of the kernel \(k(\textbf{x})\) around \(\textbf{0}\) that determine the smoothness properties (MS differentiability) of a stationary process.

Fig. 2
figure 2

Different behaviors for different covariance functions at the origin

Therefore, the behavior of a stationary covariance function at the origin affects the smoothness of the GP predictive model. In one dimension, these differences in behavior can be seen in Fig. 2 for the exponential, squared exponential, Matérn and sparse (raised-cosine) kernels. This sparse kernel, \(h(d)=\sigma _f^2 \left[ \frac{1}{3}\left( 2+\cos (\frac{2d}{l})\right) \left( 1-\frac{d}{l\pi }\right) +\frac{1}{2\pi }\sin \left( \frac{2d}{l}\right) \right] \) for \(d\le \pi \) where \(d=\left| x-x'\right| \), and \(h(d)=0\) otherwise, is described in Melkumyan and Ramos (2009).

3.2 Stationary Covariance Functions with Variable Smoothness (\(\alpha \))

To modify the behavior of stationary covariance functions at the origin, a new covariance function is proposed

$$\begin{aligned} K_{\alpha }(l,\sigma _f,\alpha )=\sigma _f^2\left( 1-\left[ 1-\frac{K_\text {cov}(l,\sigma _f)}{\sigma _f^2}\right] ^{\alpha }\right) . \end{aligned}$$
(13)

The proposed covariance function has the following hyperparameters: \(l=[l_x,l_y]\) is a two-dimensional length scale vector, \(\sigma _f\) is the amplitude hyperparameter and \(\alpha \) is interpreted as a smoothness parameter. All these hyperparameters can be learned from the input data based on Bayesian methods using the marginal likelihood in Eq. 7.

The proposed covariance function in Eq. 13 can adjust the smoothness of the base covariance function \(K_\text {cov}\) by changing the value of \(\alpha \). When \(\alpha \) equals 1, the base covariance function is reproduced, that is, \(K_\alpha (l,\sigma _f,\alpha \!=\!1)=K_\text {cov}(l,\sigma _f)\). Any stationary covariance function can be chosen as the base covariance function \(K_\text {cov}\); this includes the exponential, Matérn and sparse kernels for instance. When \(K_\alpha \) is applied to a base covariance function, it preserves the level of sparsity in \(K_\text {cov}\). In fact, \(K_\alpha (l,\sigma _f,\alpha )=0\) wherever \(K_\text {cov}\!=\!0\) for all \(\alpha \).

To find the hyperparameters, the partial derivatives of the marginal likelihood with respect to the hyperparameters are needed. Differentiating Eq. 7 yields (Rasmussen and Williams 2006)

$$\begin{aligned} \frac{\partial }{\partial \theta _j}\log p(\textbf{y}\mid X,\theta )&\,\,\,\, =\frac{1}{2}\textbf{y}^T K^{-1}\frac{\partial K_\alpha }{\partial \theta _j}K^{-1}\textbf{y}-\frac{1}{2} \text {tr}\left( K^{-1}\frac{\partial K}{\partial \theta _j}\right) \\&{\mathop {=}\limits ^{K\leftarrow K_\alpha }}\frac{1}{2}\text {tr}\left( (\textbf{w}\textbf{w}^T-K_\alpha ^{-1})\frac{\partial K_\alpha }{\partial \theta _j}\right) ,\text { where }\textbf{w}=K_\alpha ^{-1}\textbf{y}.\nonumber \end{aligned}$$
(14)

The gradient of the proposed covariance function, \(K_\alpha \), can be analytically expressed via the gradient of the base covariance function as follows

$$\begin{aligned} \frac{\partial K_\alpha (l,\sigma _f,\alpha )}{\partial l}&={\left\{ \begin{array}{ll}\alpha \left[ 1-\frac{K_\text {cov}(l,\sigma _f)}{\sigma _f^2}\right] ^{\alpha -1}\frac{\partial K_\text {cov}(l,\sigma _f)}{\partial l}, &{} \text {if }x\ne x'\\ 0, &{} \text {if }x=x'\end{array}\right. }, \end{aligned}$$
(15)
$$\begin{aligned} \frac{\partial K_\alpha (l,\sigma _f,\alpha )}{\partial \sigma _f}&=\frac{2}{\sigma _f}K_\alpha (l,\sigma _f,\alpha ), \end{aligned}$$
(16)
$$\begin{aligned} \frac{\partial K_\alpha (l,\sigma _f,\alpha )}{\partial \alpha }&={\left\{ \begin{array}{ll}-\sigma _f^2\left[ 1-\frac{K_\text {cov}(l,\sigma _f)}{\sigma _f^2}\right] ^{\alpha }\text {ln}\left[ 1-\frac{K_\text {cov}(l,\sigma _f)}{\sigma _f^2}\right] , &{} \text {if }x\ne x'\\ 0, &{} \text {if }x=x'\end{array}\right. }. \end{aligned}$$
(17)

These partial derivatives are utilized by constrained optimization solvers (gradient descent algorithms) to find appropriate kernel hyperparameters.

3.3 Intuition Behind \(\alpha \)

Qualitatively, the smoothness parameter changes the shape of the base covariance function as shown in Fig. 3. Generally speaking, decreasing \(\alpha \) from 1 towards 0 reduces the span of the kernel; this in turn makes the interpolating function more spatially localized. As a corollary, decreasing \(\alpha \) increases the bandwidth of the kernel frequency response in the Fourier domain. This can be seen clearly from the power density spectra in Fig. 4 which shows that the base kernel \(K_{\alpha =1}\) significantly attenuates high-frequency contents that correspond to edge structures and geochemical discontinuities in the spatial domain. This behavior explains why the base kernels tend to introduce more blurriness into GP regression results. The effects of reducing \(\alpha \) is that it relaxes this cutoff frequency and allows more structural information to pass through. Hence, the augmented kernel, \(K_\alpha \) provides a frequency tuning mechanism.

Fig. 3
figure 3

Effects of \(\alpha \) on the shape of various kernels (stationary covariance functions) in the spatial domain. In this illustration, the parameters \((l_x,l_y,\sigma _f)\) are all set to 1

Fig. 4
figure 4

Effects of \(\alpha \) on the power density spectrum of kernels in the frequency domain

3.4 Positive Semi-Definiteness and Valid Intervals for \(\alpha \)

A continuous translation-invariant function f of vector variable \(\textbf{x}\in \mathbb {R}^d\) is said to be positive semi-definite (psd) if

$$\begin{aligned} \sum _{m,n}^N c_m c_n f(\textbf{x}_m-\textbf{x}_n) = \textbf{c}^T F\textbf{c}\ge 0 \end{aligned}$$
(18)

for any \(\textbf{x}_1,...,\textbf{x}_N\in \mathbb {R}^d\), given \(c_1,...,c_N\in \mathbb {R}\) and \(N\in \mathbb {N}\). This is equivalent to requiring F, whose elements are given by \(F_{m,n}=f(\textbf{x}_m-\textbf{x}_n)\) for \(1\le m,n\le N\), to be a Gram matrix which is Hermitian psd. In particular, f is a symmetric positive semi-definite function if and only if F corresponds to the covariance of a GP. However, not all symmetric functions are valid covariance kernels. According to Bochner’s theorem, a shift-invariant kernel \(K_\alpha \) is psd if and only if its Fourier transform, \(\hat{k}(\omega )\), has non-negative values. The Fourier transforms of symmetric, real-valued kernels are also symmetric and real-valued. Hence, the integral \(\hat{k}_\alpha (\omega )=\frac{1}{2\pi }\int _{-\infty }^{\infty }K_\alpha (t) e^{-i\omega t} dt \propto \int _0^{\infty }K_\alpha (t) \cos (\omega t) dt\) is computed numerically for various \(\alpha \) to determine the valid interval for which the proposed covariance function \(K_\alpha \) is positive semi-definite. Figure 5(top) shows a series of \(\hat{k}_\alpha (\omega )\) obtained by varying \(\alpha \) over the valid intervals. Figure 5(bottom) shows a few cases of \(\alpha \) for which the non-negative requirement is violated.

Fig. 5
figure 5

The Fourier transform of \(K_\alpha \) computed by numerical integration

The valid intervals of \(\alpha \) that guarantee a positive semi-definite \(K_\alpha \) are obtained; these are shown in Table 1.

Table 1 Valid interval of \(\alpha \) for which \(K_\alpha \) is positive semi-definite

Based on these results, GP regression results produced by the standard kernels (\(K_\text {cov}\equiv K_{\alpha =1}\)) and augmented kernels (\(K_\alpha \)) were compared. Figure 6 shows the expected smoothing behavior (slow transition and Gibbs oscillation) when the base kernels, with \(\alpha \) set to 1, respond to a step change. This is consistent with the observations made in Sect. 3.3 whereby \(\alpha =1\) is responsible for a wider kernel span and narrower frequency passband. As \(\alpha \) is reduced, the GP mean prediction responds almost instantaneously to the step change. The interpolation becomes less distorted as higher spatial frequencies are better preserved.

Fig. 6
figure 6

GP regression results for a noisy step function as \(\alpha \) is varied

4 Materials and Methods

This section describes the data and procedures used in the experiments.

4.1 The Northern Great Basin (NGB) Geochemical Dataset

This public-domain multi-element geochemical dataset was compiled by the US Geological Survey and other agencies for mineral and environmental assessments. It contains 10,261 measurements of surficial materials (stream sediment and soil samples) from a period that predates large-scale mining, covering northern Nevada, south-eastern Oregon and the north-eastern tip of California (see Fig. 7). In Coombs et al. (2002), the Mesozoic cratonal margin and the approximate extent of Tertiary volcanic cover are overlaid to highlight potential correlations between geological features and geochemistry. The majority of the samples are obtained by the inductively coupled plasma (ICP) analysis method which involves dissolving a sample in a series of acids and analyzing the resultant solution by inductively coupled plasma/atomic emission spectroscopy. Chemical concentrations are expressed in ppm (not as a percentage) unless otherwise stated.

Fig. 7
figure 7

Northern Great Basin dataset: (left) geographical area, (right) sample locations

4.2 Learning Kernel Hyperparameters and Inference

In this study, length scale parameters for \(K_\alpha \) are learned by maximizing the marginal likelihood in Eq. 7 for two augmented kernels which derive from the Matérn 3/2 and squared exponential base kernels. A set of hyperparameters \(\theta =[l_x,l_y,\sigma _f,\nu ,\alpha ]\) are found for each of the fifteen chosen chemicals: Al, As, Ba, Be, Ce, Fe, Mg, Mn, Na, Ni, P, Pb, Sb, V and Zn. For this dataset, it was appropriate to apply logarithmic transformation to the chemical measurements, \(\textbf{z}\), before GP processing. This has the effect of making the geochemical distributions less skewed (see Fig. 8).

Fig. 8
figure 8

Logarithmic transformation corrects the skew in the distribution of Fe

Thus, the \(\textbf{y}\) vector is related to the raw measurements via \(\textbf{y}=g^{-1}(\textbf{z})\), where \(g^{-1}\equiv \log \). As a consequence, the GP posterior mean and variance predictions with respect to \(\textbf{z}\) are obtained from the moment estimates of Y using Taylor expansion for moments of functions of random variables. Writing \(\mu _Y\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )\) and \(\sigma _Y\equiv \sigma _*(\textbf{x}_*;\theta ,\alpha )\),

$$\begin{aligned} \mathbb {E}[Z]&=\mathbb {E}[g(Y)]\approx g(\mu _Y) + \frac{1}{2}g''(\mu _Y) \sigma _Y^2, \end{aligned}$$
(19)
$$\begin{aligned} \text {var}[Z]&=\text {var}[g(Y)]\approx \left( g'(\mu _Y)\right) ^2 \sigma _Y^2 - \frac{1}{4}\left( g''(\mu _Y)\right) ^2 \sigma _Y^4, \end{aligned}$$
(20)

where \(g(\cdot )\!\equiv \exp (\cdot )\). Without loss of generality, the data Y are assumed to be centered.

4.3 Measuring Spatial Fidelity Using the Structural Similarity Index

In this study, the set of locations \(X_*=\{x_{*i}\}_{i=1}^{M}\) for which f is unknown are defined over a dense uniform grid that covers a two-dimensional modeling region. To evaluate the quality of the GP regression for \(X_*\), the predicted mean \(\mu _*(\textbf{x}_*;\theta ,\alpha )\in \mathbb {R}^M\) obtained using the hyperparameters \(\theta \) and \(\alpha \) is compared with a reference scalar field, \(v_\text {ref}(\textbf{x}_*)\in \mathbb {R}^M\), defined by the training samples T. The computational details of \(v_\text {ref}(\textbf{x}_*)\) are described in Sect. 4.4. For validation, an established measure known as structural similarity index (SSIM) is chosen to indicate how well GP regression preserves spatial structure in the underlying distribution. When \(\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )\) is compared with \(\textbf{v}\equiv v_\text {ref}(\textbf{x}_*)\), the structural similarity index proposed by Wang et al. (2004) is computed from the statistical moments of \(\textbf{u}\) and \(\textbf{v}\), and defined as a product of three similarity terms

$$\begin{aligned} \text {luminance (mean) as }&l(\textbf{u},\textbf{v})=\frac{2\mu _u \mu _v+C_1}{\mu _u^2+\mu _v^2+C_1}, \end{aligned}$$
(21)
$$\begin{aligned} \text {contrast (variance) as }&c(\textbf{u},\textbf{v})=\frac{2\sigma _u \sigma _v+C_2}{\sigma _u^2+\sigma _v^2+C_2}, \end{aligned}$$
(22)
$$\begin{aligned} \text {shape (correlation) as }&s(\textbf{u},\textbf{v})=\frac{\sigma _{u,v}+C_3}{\sigma _u\sigma _v+C_3}, \end{aligned}$$
(23)

where \(\mu _u\), \(\mu _v\), \(\sigma _u\), \(\sigma _v\) and \(\sigma _{u,v}\) represent the local means, local variances, and local covariance of u and v, respectively; \(C_1\), \(C_2\) and \(C_3= \frac{C_2}{2}\) represent three tiny positive constants. The local mean, for instance \(\mu _v\), may be estimated by convolving \(v(x_{*})\) with a Gaussian window function with a standard deviation of 1.5 times the \(X_*\) grid spacing. The local covariance is estimated in a similar manner, by convolving \((u(x_{*})-\hat{\mu }_u)(v(x_{*})-\hat{\mu }_v)\) with the same Gaussian low-pass filter. Hence, the values of \(\mu _u\), \(\mu _v\), \(\sigma _u\), \(\sigma _v\) and \(\sigma _{u,v}\) are location-dependent. These signal processing concepts are described in Oppenheim et al. (2001). The structural similarity measure is finally given by

$$\begin{aligned} \text {SSIM}(\textbf{u},\textbf{v})&=\left| l(\textbf{u},\textbf{v})\right| \cdot \left| c(\textbf{u},\textbf{v})\right| \cdot \left| s(\textbf{u},\textbf{v})\right| \nonumber \\&=\frac{(2\mu _u\mu _v + C_1)(2\sigma _{u,v} + C_2)}{(\mu _u^2+\mu _v^2+C_1)(\sigma _u^2+\sigma _v^2+C_2)}\in \mathbb {R}_{+}^M. \end{aligned}$$
(24)

For our purpose, SSIM may be understood as a spatial degradation (or quality) measure. It satisfies the symmetry, boundedness and unique maximum properties, namely \(\text {SSIM}(\textbf{u},\textbf{v})\!=\!\text {SSIM}(\textbf{v},\textbf{u})\), \(\text {SSIM}(\textbf{u},\textbf{v})\!\le \!1\), and \(\text {SSIM}(\textbf{u},\textbf{v})\!=\!1\) if and only if \(\textbf{u}\!=\!\textbf{v}\). A related spatial distortion metric can be derived from SSIM. This metric, shown in Eq. 25, is referred to as the normalized root mean square error (NRMSE)

$$\begin{aligned} \text {NRMSE}(\textbf{u},\textbf{v})=\sqrt{1-c(\textbf{u},\textbf{v})s(\textbf{u},\textbf{v})}. \end{aligned}$$
(25)

Brunet et al. (2011) have shown that this metric satisfies quasi-convexity. In general, this is a useful property for nonlinear optimization, as it ensures the existence of a global minimum on any convex subset of the function domain.

4.4 Experiment 1: Evaluating the Effects of \(\alpha \)

The objective of the first experiment is to demonstrate both subjective and objective improvement when the augmented covariance functions, \(K_\alpha \), are used in GP regression. In Sect. 3.4, this has been shown to be true for a signal that resembles a step function. Based on kernel frequency response arguments, the expectation is that this trend will hold for chemical distributions found in the Northern Great Basin (NGB) geochemical dataset.

The hyperparameters are learned by maximizing the log-marginal likelihood (LML) as described previously, with lower bounds [0.01, 0.01, 0.001, 0.01, 0.1] and upper bounds \([0.5,0.5,\infty ,\text {percentile}(\Delta \textbf{z},95)\!\times \!2,1]\) imposed on \([l_x,l_y,\sigma _f,\nu ,\alpha ]\), where \(\Delta \textbf{z}\) represents the first-order difference performed on a sorted sequence of chemical measurements, \(\text {sort}(\textbf{z})\). For computational efficiency, GP training is performed on \(T_s\), an \(L(\!=\!2000)\) point random subset of the supplied data, \(T=\{(x_i,y_i\!=\!\log (z_i))\}_{i=1}^{N=10261}\), that minimizes the KL divergence.

The results presented in Sect. 5.1 will consist of visual inspection and quantitative analysis. For quantitative analysis, the \(\chi ^2\) statistic and the mean structural similarity index (see SSIM in Sect. 4.3) averaged over the modeled region will provide a measure of the spatial fidelity of the GP mean estimates \(\mu _*(\textbf{x}_*;\theta ,\alpha )\in \mathbb {R}^M\) obtained using K and \(K_\alpha \) with respect to the reference scalar field, \(v_\text {ref}(\textbf{x}_*)\in \mathbb {R}^M\), computed from the full dataset T. In practice, the normalized statistic \(\bar{\chi }^2\!=\!\tfrac{1}{M}\chi ^2(\textbf{u},\textbf{v})\) is used, where \(\chi ^2(\textbf{u},\textbf{v})=\sum _{i=1}^M \frac{(u_i-v_i)^2}{v_i}\). With \(\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )\in \mathbb {R}^M\) and \(\textbf{v}\equiv v_\text {ref}(\textbf{x}_*)\in \mathbb {R}^M\), \(u_i\) denotes an observed regression value obtained using \(K_\alpha \), and \(v_i\) denotes the expected value obtained from the reference scalar field, \(v_\text {ref}\).

In terms of implementation, specifically in relation to \(v_\text {ref}(\textbf{x}_*)\), at each query location \(x_{*}\), the reference value \(v_\text {ref}(x_{*})\) is computed by interpolating the 16 nearest samples from \(T=\{x_i,z_i\}_{i=1}^N\) using inverse distance weights with exponent 3. This choice is guided by what USGS had used for these data (Coombs et al. 2002). For this study, the reference should be seen as synthesized ground truth, rather than an attempt to reconstruct the underlying random function according to some optimality criteria. Hence, \(v_\text {ref}\) only needs to satisfy \(y_i=v_\text {ref}(x_i)\) at the sampled locations and exhibit realistic variation that is meaningful at spatial scales of interest.

For the inverse distance weights, the minimum separating distance is capped at 0.1 min latitude which equates to ten times the resolution (or one tenth the spacing) of the inference grid (\(X_*\)) at about 185 m. As an overview, Fig. 9 shows the location and value of samples, \(T=\{(x_i,z_i)\}_{i=1}^N\), for various chemicals. The corresponding reference scalar fields—which are computed from T and evaluated on a regular grid \(X_*\)—are shown in Fig. 10.

Although the reference fields might suffice for certain applications that rely solely on visual interpretation, it is worth emphasizing that GP regression also computes the posterior variance and provides a mathematical framework for quantifying uncertainty of a random process which may be important for quantitative risk assessment. Figure 11 provides a compelling example of this. In Fig. 11a, the curves show the magnitude of the predicted mean and standard deviation (s.d.) as ordered by \(\mu _*\) in ascending order. Figure 11b shows that the predicted s.d. captures mainly the epistemic uncertainty associated with the sampling in this example. An advantage of using GP is that it allows lower and upper bounds with arbitrary confidence, for example, \(\mu _*-2\sigma _*\) and \(\mu _*+2\sigma _*\), to be computed and used for probabilistic reasoning. This is shown in Fig. 11c, d.

Fig. 9
figure 9

Northern Great Basin raw data. Panels show chemical values and location of samples. (Left–right) Measured concentration for As, Ce, Na and V

Fig. 10
figure 10

Northern Great Basin derived data. Panels show the reference scalar field obtained via inverse distance interpolation. (Left–right) As, Ce, Na and V

Fig. 11
figure 11

GP regression provides a posterior predictive distribution of the standard deviation

4.5 Experiment 2: Validation Using SSIM and NRMSE

The second experiment determines whether there is a consistent relationship between the log-marginal likelihood (LML) and the structural similarity index measure (SSIM), whose properties are well established in the field of image processing. For instance, in Veras and Collins (2019), SSIM is found to be capable of capturing discernible differences that relate to spatial structures and patterns. A pertinent question is whether the \(\alpha \) found by minimizing the negative LML across all hyperparameters (henceforth, referred as “NLML-optimized” alpha) is optimal in the SSIM sense. In practice, it is possible firstly for the hyperparameters (solution) to be stuck at a local minima during gradient descent; secondly the LML measure may not treat regression errors in a manner that reflects visual degradation to spatial structures or geochemical features.

To answer this question, we keep all the parameters other than alpha, namely \(\theta \backslash \alpha =\{l_x,l_y,\sigma _f,\nu \}\), fixed for each chemical. GP regression is performed for each \(\alpha \) between \(\alpha _{\min }=0.1\) and \(\alpha _{\max }=1\) in increments of 0.1. The mean SSIM, NRMSE and LML statistics are computed as a function of \(\alpha \). These measures (discrete values parameterized by \(\alpha \)) are interpolated using a spline function, and the \(\alpha \) values corresponding to the peak SSIM and LML are recorded.

4.6 Experiment 3: Establishing Relationships in the Spectral Domain

The third experiment investigates if there is any plausible connection between \(\alpha \) (a property of the augmented stationary kernel \(K_\alpha \)) and spectral properties of the random process (geochemical distribution) in the frequency domain. The details of this are deferred until Sects. 5.3 and 6.1, where these issues are further discussed.

5 Results and Analysis

5.1 Evaluating the Effects of \(\alpha \)

To demonstrate an improvement in GP regression when augmented covariance functions are used, Fig. 12 provides a visual comparison of the posterior mean estimates, \(\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )\), obtained using the Matérn 3/2 base kernel (\(K_\text {cov}=K_{\alpha =1}\)) and augmented kernel (\(K_\alpha \) with variable \(\alpha \)) for four chemicals: As, Ce, Na and V.

Significant smoothing and blocking artifacts can be seen in V and Na from results in the top row which correspond to base kernels. Higher spatial fidelity is observed from results in the bottom row which correspond to augmented kernels. With \(\alpha <1\), spatial structures in the chemical distribution are more faithfully preserved.

Fig. 12
figure 12

Visual comparison of GP mean obtained using the base kernel \(K_\text {cov}\!=\! K_{\alpha =1}\) (top) and augmented kernel \(K_{\alpha }\) (bottom). Hyperparameters are shown in Appendix A

To objectively analyze the GP regression results, \(\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )\) are compared with the reference \(\textbf{v}\equiv v_\text {ref}(\textbf{x}_*)\) obtained from the full dataset, as described in Sect. 4.4. Table 2 measures discrepancies between \(\textbf{u}\) and \(\textbf{v}\) using the normalized \(\bar{\chi }^2(\textbf{u},\textbf{v})\) statistic, and similarities between \(\textbf{u}\) and \(\textbf{v}\) using the SSIM. Henceforth, the subscripts “base” and \(\alpha \) indicate whether the results are obtained with the base kernel, \(K_\text {cov}\), or augmented kernel, \(K_\alpha \).

A comparison of \(\text {SSIM}_\text {base}\) versus \(\text {SSIM}_{\alpha }\) shows that the proposed kernel \(K_\alpha \) consistently produces mean estimates that are closer to the reference than those produced by the base kernel \(K_\text {cov}\). Out of 15 chemical distributions, the only exception is Sb. The reason for this will be investigated in the next subsection. Similarly, a comparison of the \(\bar{\chi }^2_\text {base}\) and \(\bar{\chi }^2_{\alpha }\) columns reveals that the base kernel leads to higher spatial distortion. This evidence shows that even if the LML-optimized \(\alpha \) parameters are suboptimal, they still produce higher-quality GP regression results relative to the baseline kernels.

Table 2 Comparison of \(\textbf{u}\equiv \mu _*(\textbf{x}_*;\theta ,\alpha )\) with respect to \(\textbf{v}\equiv v_\text {ref}(\textbf{x}_*)\) for the base kernels and augmented kernels (with variable \(\alpha \))

5.2 Validation Using SSIM and NRMSE

The goal is to establish whether there is a consistent relationship between the log-marginal likelihood (LML) and the structural similarity index (SSIM). To facilitate this, the SSIM(\(\alpha \)), NRMSE(\(\alpha \)) and LML(\(\alpha \)) curves are all plotted as a function of \(\alpha \) following the procedure described in Sect. 4.5. Notation-wise, the alpha value found during GP training—which optimizes all hyperparameters \(\theta \) jointly by maximizing the LML—is denoted as \(\alpha _0\). The alpha values corresponding to the peak of the SSIM and LML curves (obtained by fixing \(\theta \backslash \alpha \)) are denoted as \(\alpha _*\) and \(\alpha _1\), respectively. These alpha values are plotted along the horizontal axis in Fig. 13. Two difference terms are defined, \(\Delta \alpha _{0,*}=\alpha _0\!-\!\alpha _*\) and \(\Delta \alpha _{1,*}=\alpha _1\!-\!\alpha _*\). Similarly, changes in SSIM are defined as \(\Delta S_{0,*}=\text {SSIM}(\alpha _0)\!-\!\text {SSIM}(\alpha _*)\) and \(\Delta \text {S}_{1,*}=\text {SSIM}(\alpha _1)\!-\!\text {SSIM}(\alpha _*)\), respectively. These represent a quality gap along the vertical axis.

Fig. 13
figure 13

Alpha values obtained.   = \(\alpha _0\) from optimizing all hyperparameters \(\theta \). \(\varvec{\bigcirc }\) = \(\alpha _*\) from SSIM. \(\varvec{\bigcirc }\) = \(\alpha _1\) from LML with \(\theta \backslash \alpha \) fixed

Table 3 Efficacy of SSIM and LML as a spatial fidelity measure for GP regression

Looking at Table 3, the \(\alpha _0\)s—the smoothing parameter that forms part of hyperparameters \(\theta =[l_x,l_y,\sigma _f,\nu ,\alpha ]\) that maximizes the LML—are generally smaller than the optimal value \(\alpha _*\) obtained under the SSIM criterion. The more revealing finding is that even if one performs a line search with the other hyperparameters \(\theta \backslash \alpha \) held constant, the \(\alpha _1\)’s that correspond to the peak on the LML curve often disagree with \(\alpha _*\) (see bold figures in the \(\Delta \alpha _{1,*}\) column). This suggests LML and SSIM target different aspects of the regression errors. LML is arguably less efficient at preserving sharp geochemical features, whereas SSIM prioritizes information that encodes spatial structures.

In terms of impact, the results may be dissected into three categories. For Al, As, Ba, Mn and Ni, there are no significant differences between the LML-optimal \(\alpha _1\) and SSIM-optimal \(\alpha _*\). For Be, Ce, Fe and P, \(\left| \Delta \alpha _{1,*}\right| \) is in the range of 0.13 to 0.18; however, as \(\alpha _1\) is relatively close to the plateau of the SSIM curve, its impact on the SSIM score is only moderate, with \(\Delta S_{1,*}\) restricted to \(-\)0.05. For Mg, Na, Pb, Sb, V and Zn, the shift \(\left| \Delta \alpha _{1,*}\right| \) is greater than 0.3; it also results in significant structural degradation relative to choosing the SSIM-optimal \(\alpha _*\). These trends can be seen in Fig. 13. An interesting fact is that both SSIM(\(\alpha \)) and NRMSE(\(\alpha \)) are convex.

The bottom panels in Fig. 13 show instances where the LML is relatively flat for \(\alpha \) in the [0.1,0.5] range. This indicates that the LML is insensitive to the changes detected by SSIM, which amount to significant visual differences (see Fig. 14). At opposite ends, these \(\alpha \) values dictate whether spatial structures are well preserved (when \(\alpha \approx 0.5\)) or lost (when \(\alpha \approx 0.1\)). To appreciate the improvements possible, some GP regression results obtained with NLML-optimized \(\alpha _0\) are contrasted with those obtained with SSIM-optimal \(\alpha _*\) in Fig. 14. These visual comparisons show that the SSIM-optimal \(\alpha _*\) indeed produces sharper and higher-quality results.

Fig. 14
figure 14

Visual comparison of GP mean obtained using \(\alpha _0\) (top) and \(\alpha _*\) (bottom)

Indeed, as \(\alpha \) approaches 0.1, the peaks and troughs in the geochemical distribution become less distinct. This may be explained by revisiting the magnitude response of the augmented kernels. Figure 15 illustrates the “leaky” nature of \(\hat{k}_\alpha (\omega )\) as \(\alpha \) is decreased. Due to inadequate suppression of higher frequencies (for \(\omega > rapprox 0.2\)), the observation noise power injected into the GP predictive mean in the form of a white noise spectrum (see \(K(X,X)+\nu ^2 I\) term in Eq. 5) will be less attenuated than it would otherwise be, if \(\alpha \) had increased. This effectively reduces the signal-to-noise ratio.

The extent of this deterioration is dependent on the spectral properties of each geochemical distribution or random process. This trade-off is hinted at by the graphs in Fig. 13. Selecting a suitable alpha amounts to balancing two extremes: being too aggressive (\(\alpha \rightarrow 0.1\) admits more higher spatial frequencies, perhaps at a cost of noise pollution) and not being aggressive enough (\(\alpha \rightarrow 1\) shrinks the low-frequency passband which may blur or distort the GP mean estimate).

Fig. 15
figure 15

Magnitude response \(\left| \hat{k}(\omega )\right| \) of the proposed kernels \(K_\alpha \)

5.3 Quality Maps and Spectral Perspective

The results in this section bring together the concept of structural integrity (SSIM, NRMSE) and notion of smoothness defined in terms of Sobolev space. As a demonstration, the reference geochemistry \(v_\text {ref}(\textbf{x}_*)\) and GP mean function estimates \(\mu _*(\textbf{x}_*;\theta ,\alpha \!=\!1)\) and \(\mu _*(\textbf{x}_*;\theta ,\alpha \!<\!1)\) for Al, Ce, Mn and V are shown in Fig. 16.

Fig. 16
figure 16

For evaluation, (top) reference geochemistry \(v_\text {ref}(\textbf{x}_*)\) and GP estimates (middle) \(\mu _*(\textbf{x}_*;\theta ,\alpha \!=\!1)\), (bottom) \(\mu _*(\textbf{x}_*;\theta ,\alpha \!<\!1)\) for Al, Ce, Mn and V

Recall from Sect. 4.3 that spatial degradation from over-smoothing can be measured using SSIM (structural similarity) and the NRMSE (normalized root mean square error) metric. Figures 17 and 18 show that the augmented kernels \(K_\alpha \) increase the spatial fidelity of GP mean predictions under the condition that other hyperparameters are fixed. For SSIM, brighter patches indicate local structures are better preserved. Conversely, darker regions in NRMSE indicate lower distortion. These maps provide an objective assessment of spatial quality (local accuracy) for GP regression.

Fig. 17
figure 17

SSIM (structural similarity) maps comparing GP estimates with the reference. Note: the same intensity scale is used throughout

Fig. 18
figure 18

NRMSE distortion maps comparing GP estimates with the reference. Again, the same intensity scale is used throughout to highlight relative differences

The second aspect is concerned with reproducing the autocorrelation of the underlying random process, or reducing smoothness misspecification from the viewpoint of functional analysis. The concept of equivalent interpolating kernel (defined in Sect. 2.2) provides the intuition that adjusting \(\theta \), but especially \(\alpha \), changes the rate of decay in the frequency response of the equivalent kernel \(Q_{K_\theta }(\omega )\) in (12). It therefore influences the smoothness of the GP mean function estimate in the Sobolev sense. This is demonstrated in Fig. 19, which shows that the \(\alpha \) value obtained from LML optimization (7), or \(\alpha _*\) if fine-tuned using SSIM, moderates the decay rate of the interpolating kernel in the Fourier domain. This compensates for over-smoothing and minimizes the \(\Phi \) versus \(\Psi \) discrepancies in a global sense. Although the effectiveness of smoothness compensation can be diminished by variability in the estimated hyperparameters, Monte Carlo simulation shows that the SSIM-optimized \(\alpha _*\) values are particularly robust with respect to changes in the length scale parameters. These results are reported in Appendix B.

Fig. 19
figure 19

Power spectra as described in Sect. 2.2: (top) \(\Phi (\omega )\), \(\Psi _\theta (\omega )\) and \(\Psi _{\theta ,\alpha }(\omega )\), (bottom) squared magnitude response of the equivalent kernels, \(|Q_{K_\theta }(\omega )|^2\) and \(|Q_{K_{\theta }(\alpha )}(\omega )|^2\). The curves in light gray correspond to standard GP regression (\(\alpha \!=\!1\)) which leads to over-smoothing

6 Discussion

The preceding analysis suggests it would be reasonable to hypothesize a spectral connection between the kernel and geochemical distribution (random process) which governs how \(\alpha \) is selected. To explore this further, the concept of spectral flatness (Dubnov 2004) will now be introduced. Spectral flatness \(\gamma _V^2\)—also known as Wiener entropy—represents a ratio of the geometric and arithmetic means of a power spectrum, \(S_V(\omega )\). It measures how fast the power density decays with spatial frequency in the spectral domain. The property \(\gamma _V^2\le 1\) always holds, with equality if and only if \(S_V(\omega )\) is flat.

6.1 Establishing Relationships in the Spectral Domain

In our application, the two-dimensional discrete Fourier transform (DFT) (Taubman and Marcellin 2002) is first computed for the reference scalar field, \(v_\text {ref}(\textbf{x}_*)\in \mathbb {R}^{m_1\times m_2}\), to produce \(V_{k_1,k_2}\in \mathbb {C}^{n\times n}\), where \(n\ge \max \{m_1,m_2\}\). The DFT is computed using the N-point fast Fourier transform (FFT) algorithm. The DFT coefficients are given by \(V_{k_1,k_2}=\frac{1}{n}\sum \limits _{p_1=0}^{n-1}\sum \limits _{p_2=0}^{n-1}v_{p_1,p_2}e^{-i\tfrac{2\pi }{n}(p_1 k_1+p_2 k_2)}\) for \(0\le k_1,k_2 < n\). This first step is shown in Fig. 20(left) for two different chemical distributions, Al and Pb. Subsequently, a one-dimensional power density spectrum, \(S_V(\omega _r)=\tfrac{2}{\pi }\int _{0}^{\pi /2} \left| V_{\omega _r,\theta }\right| d\theta \), is obtained by averaging the magnitude response over angles in the first quadrant \(0\le k_1,k_2 < n/2\), with radial frequencies \(\omega _r=\Vert \omega \Vert =\sqrt{k_1^2+k_2^2}\). This second step produces the power density spectra shown in Fig. 20(right).

Fig. 20
figure 20

Fourier transformation. Spatial distribution of Al and Pb, \(v_\text {ref}\), in (a, b). The corresponding two-dimensional DFT magnitude spectra, \(\left| V_{k_1,k_2}\right| \), is in (c, d). The average power density spectra, \(S_V(\omega _r)\), for all chemicals

It is worth noting that for three cases, Pb, Zn and Sb, where LML optimization failed to produce sensible \(\alpha _1\) or \(\alpha _0\) (see category 3 in Fig. 14, where the results of \(\alpha _*\) and \(\alpha _0\) diverge), all have relatively large higher-frequency contents. A common thread is that all three have a “nuggety” chemical distribution with sharp spatial gradients around many clusters. Thus, one would expect them to be characterized by a large spectral flatness value \(\gamma _V^2\), and this can be confirmed in Table 4. Concretely, if \(\Omega =\{\omega _r\}_r\) defines the set of sampled frequency points \(\omega _r\), spectral flatness (the Wiener entropy associated with the random process) is computed as

$$\begin{aligned} \gamma _V^2=\frac{\exp \left( \tfrac{1}{|\Omega |}\sum _r \log _e S_V(\omega _r)\right) }{\tfrac{1}{|\Omega |}\sum _r S_V(\omega _r)}. \end{aligned}$$
(26)
Table 4 Spectral flatness (Wiener entropy) for different chemical processes

Utilizing the results obtained for both the Matérn and SE augmented kernels, Fig. 21(left) shows that the SSIM-optimal \(\alpha _*\) values are correlated with log(spectral flatness). The correlation coefficient \(\rho (\alpha _*,\log (\gamma _V^2))\) is 0.718. Motivated by the concept of equivalent kernels (Sollich and Williams 2004), we define the \(-20\) dB point as the cutoff frequency \(\omega _c\) (or equivalent bandwidth) of the proposed covariance functions, \(K_{\alpha _*}\). Figure 21(right) shows that this kernel bandwidth (\(\omega _c\), for which \(|\hat{k}_\alpha (\omega _c)|/|\hat{k}_\alpha (0)|\approx e^{-1}\)) is linearly correlated with log(spectral flatness). The correlation coefficient \(\rho (\omega _c(\alpha _*),\log (\gamma _V^2))\) is 0.834.

The implication is that in addition to (1) obtaining \(\alpha _0\) directly by finding hyperparameters \(\theta =\{l_x,l_y,\sigma _f,\nu ,\alpha \}\) that jointly maximize the log-marginal likelihood, and (2) performing a line search with the remaining hyperparameters \(\theta \backslash \alpha \) fixed, finding an \(\alpha _*\) that maximizes the structural similarity index SSIM(\(\alpha \))—this may require a few GP evaluations and interpolation of a convex function—there is a third option. First, spectral flatness (\(\gamma _V^2\)) may be computed on \(v_\text {ref}\) using Eq. 26 for a chemical distribution of interest. Then, \(\omega _c\) may be inferred using the linear relationship between \(\log (\gamma _V^2)\) and kernel bandwidth. Finally, the \(\tilde{\alpha }_\text {sf}\) (here, the subscript “sf” denotes spectral flatness) that generates a kernel \(K_\alpha \) with the closest equivalent bandwidth is chosen as a proxy for \(\alpha _*\). This has a time complexity of \(O(n\log n)\) as opposed to \(O(n^3)\) for GP regression. Thus, explicit SSIM optimization of \(\alpha \) (option 2) may be deferred and performed as a refinement step only if \(\alpha _0\) deviates significantly from \(\tilde{\alpha }_\text {sf}\), say, \(|\alpha _0-\tilde{\alpha }_\text {sf}|> \tau \) for some threshold \(\tau \). When \(\text {SSIM}(\tilde{\alpha }_\text {sf})\gg \text {SSIM}(\alpha _0)\), \(\tilde{\alpha }_\text {sf}\) may be used instead of \(\alpha _0\). Alternatively, two neighboring points, say, \(\text {SSIM}(\tilde{\alpha }_\text {sf}-\delta )\) and \(\text {SSIM}(\tilde{\alpha }_\text {sf}+\delta )\), may be evaluated to find the best \(\alpha \) value in the \(\left[ \tilde{\alpha }_\text {sf}-\delta ,\tilde{\alpha }_\text {sf}+\delta \right] \) range using a spline interpolator.

Fig. 21
figure 21

Left: correlation between SSIM-optimal \(\alpha _*\) and spectral flatness, \(\log (\gamma _V^2)\). Right: correlation between \(K_{\alpha _*}\) kernel bandwidth and \(\log (\gamma _V^2)\)

7 Concluding Remarks

A new class of stationary covariance functions (\(K_\alpha \)) with a control parameter \(\alpha \) has been proposed to address the issue of excessive smoothing commonly observed in Gaussian process regression. To satisfy the positive semi-definite requirement, \(\alpha \) intervals were determined to ensure that \(K_\alpha \) remained a valid kernel. Partial derivatives with respect to the kernel hyperparameters were obtained and expressed in terms of the gradient of the base covariance functions. This allowed all the hyperparameters \(\theta \) (including \(\alpha _0\)) to be learned from the data by maximizing the log-marginal likelihood (LML). Using the \(\bar{\chi }^2\) formula and structural similarity index (SSIM), the spatial fidelity of GP mean predictions was evaluated against a reference informed by geochemical samples from the dataset. The performance of the proposed kernel \(K_\alpha \) (with variable \(\alpha \)) was compared with the base kernel \(K_\text {base}\) (with \(\alpha \) fixed at 1). Visual inspection and quantitative analysis both demonstrated consistent improvement from using \(K_{\alpha }\) relative to the baseline. Experiments also revealed that the \(\alpha _1\) values which maximize the LML differed from the optimal value \(\alpha _*\) found by SSIM 65% of the time. Among these, 60% experienced significant quality degradation whereby the GP regression distortion \(\Delta S_{1,*}\) due to \(\alpha _1\) varied from \(-\)0.1595 (lower quartile) to \(-\)0.3031 (upper quartile) relative to a perfect SSIM score of 1.

This study described how changes in \(\alpha \) affect the frequency response of the proposed kernel. Setting \(\alpha \) close to 1 shrinks the low-frequency passband, which may blur or distort the GP mean estimate; the extent depends on the spectral properties of the signal. Conversely, setting \(\alpha \) close to 0.1 allows more structural information in the form of higher spatial frequencies to pass through; geochemical features are generally better preserved at the risk of white noise amplification. The optimal setting usually represents a balance between these two extremes. Formally, smoothness was defined in terms of having a finite norm \(\Vert f\Vert _{W_2^k(\mathbb {R}^d)}^2\) for all \(k<m_0\) in the Sobolev space. The results in Sect. 5.3 demonstrated that over-smoothing occurs when the equivalent kernel frequency response, \(|Q_{K_\theta (\alpha )}(\omega )|^2\), decays at a much faster rate than the target function, \(|\hat{f}(\omega )|^2\). From a spectral perspective, the \(K_\alpha \) covariance function thus provides a mechanism for compensating over-smoothing by regulating this rate of decay. The final contribution was demonstrating a linear dependence of \(\alpha _*\) on the log-spectral flatness of the power spectrum. The correlation coefficient for the kernel bandwidth and Wiener entropy of the geochemical process was 0.834. The significance of this result is that an approximate value of \(\alpha _*\) can potentially be found using this relationship at a cost of \(O(n\log n)\).

To the best of the authors’ knowledge, the concepts of structural similarity, Sobolev smoothness and spectral flatness have not been used previously in geostatistics, in relation to GP regression on soil samples. From a complementary perspective, the spatial information embedded in geochemical processes may also be analyzed using spatial statistics such as principal coordinates of neighbor matrices (PCNMs) (Borcard et al. 2004) or Moran’s eigenvector maps (MEMs) (Griffith and Peres-Neto 2006)]. This decomposition allows multiscale spatial variations to be visualized as geographical maps (Dray 2020). For analysis, significant spatial structures may be represented by significant PCNM variables with large eigenvalues. Since the eigenvalues correspond to Moran’s coefficient of spatial autocorrelation, a connection with the local power spectrum is conceivable (Messerschmitt 2006). These relationships and how they relate to geospatial patterns and kernel bandwidth may be investigated in future work.