1 Introduction

Cross-validation and jackknifing are established methods for validating statistical models. In a geostatistical context, the model is based either on geostatistical estimation via kriging or on spatial simulation. The standard outputs include scatterplots and correlation coefficients of estimated against true values, and (possibly standardized) estimation errors against estimates, accompanied by error statistics (Webster and Oliver 2007). These are widely used to validate and optimize model parameters or to determine the most suitable model from a set of competing models. In addition, scatterplots of coverage probabilities versus theoretical values may be used to check the quality of the posterior distributions derived from the model (Deutsch 1997; Olea 2012). These approaches were initially formulated for the estimation/simulation of univariate random functions or for cases where a clear primary variable is to be modeled, with one or more covariates which are of secondary importance. However, in the case of fully multivariate data such as directional (van den Boogaart and Schaeben 2002a, 2002b) or compositional data (van den Boogaart and Tolosana-Delgado 2013; Pawlowsky-Glahn and Egozcue 2020), the entire regionalized vector is seen as an entity which needs to be modeled rather than just its component parts. As a consequence, geostatistical estimation and simulation need to be treated as fully multivariate, and any appraisal of the quality of the geostatistical model needs to take this aspect into account. This concerns all statistical results mentioned before: error statistics, correlation between predictions and observations, and accuracy of estimated intervals.

In this contribution, a generalization of accuracy in the sense of Deutsch (1997) is proposed for the multivariate setting. The proposal is analogous to the method described by Olea (2012) for quantifying the quality of the estimated distribution. Of specific interest is the evaluation of the suitability of a geostatistical estimation or simulation model in the compositional framework, namely, where each variable is non-negative, and its values inform of the relative abundance of a certain component forming the system (Tolosana-Delgado et al. 2019).

After this introduction, four further sections follow. In Sect. 2 the fundamentals of compositional data analysis are recalled briefly along with their implications in geostatistics. Section 3 reviews the existing proposals for univariate validation, with methods, diagrams, and statistics commonly used for this task. In Sect. 4 a fully multivariate approach to cross-validation is proposed for vector-valued random functions specified on the example of compositional data, both for cokriging outcomes (Sect. 4.1) and cosimulation (Sect. 4.2). Two case studies are presented in Sect. 5 illustrating these two sets of techniques. Conclusions are provided in Sect. 6.

2 Regionalized Compositions and Their Geostatistical Treatment

A regionalized composition is a set \(\{\mathbf{z}\left(u\right)=\left[{z}_{1}\left(u\right), \dots ,{z}_{D}\left(u\right)\right]: {z}_{k}\left(u\right)\ge 0, k=1,\dots ,D;{\sum }_{k=1}^{D}{z}_{k}\left(u\right)=c, u\in \mathcal{A}\}\) of compositional data defined on some study region \(\mathcal{A}\), where \(u\in \mathcal{A}\) denotes a location in \(\mathcal{A}\), and \(c\) is an arbitrary but fixed constant. To avoid problems arising out of the fact that compositional data are closed to that constant sum and formed by non-negative components, compositions are usually transformed prior to any geostatistical or statistical treatment. Several logratio transformations are commonly used. These include the centered (clr; Aitchison 1986), additive (alr; Aitchison 1986) and isometric (ilr; Egozcue et al. 2003) logratio transforms. However, the choice of logratio transformation does not impact the final results, because the geostatistical techniques discussed here are affine-equivariant (Filzmoser and Hron 2008; Tolosana-Delgado et al. 2019). Affine equivariance implies that \(\mathbf{m}\left(\mathbf{Z}B\right)=\mathbf{m}(\mathbf{Z})B\) and

$$S\left(\mathbf{Z}B\right)={B}^{T}S\left(\mathbf{Z}\right)B,$$
(1)

where \(\mathbf{m}\) denotes the mean, \(S\) a covariance matrix, and \(B\) a linear transformation. In spite of this invariance, it is mathematically convenient to use the ilr transformation in the geostatistical workflow (Pawlowsky-Glahn and Egozcue 2020). The corresponding regionalized composition of ilr-transformed variables will be denoted by \(\{{\varvec{\upzeta}}\left(u\right)=\left[{\zeta }_{1}\left(u\right), \dots ,{\zeta }_{D-1}\left(u\right)\right]: u\in \mathcal{A}\}\). The image space of the ilr transformation is \((D-1)\)-dimensional Euclidean space.

The standard workflow then is known as the principle of working in coordinates:

  1. 1.

    Transform the regionalized composition to logratios using a suitably chosen ilr transformation (Egozcue et al. 2003; Tolosana-Delgado and Mueller 2021) \({\varvec{\zeta}}\left(u\right)={\ln}\mathbf{z}\left(u\right)V\), where \(V\) is a \(D\times (D-1)\) matrix with \({V}^{T}V={I}_{D-1}\) and \(V{V}^{T}={I}_{D}-\frac{1}{D}{1}_{D\times D}\).

  2. 2.

    Apply the geostatistical technique to the logratios.

  3. 3.

    Backtransform the geostatistical estimate or realization of the logratio scores, \({{\varvec{\zeta}}}^{\boldsymbol{*}}\left(u\right)\), to the compositional space via \({{\varvec{z}}}^{\boldsymbol{*}}\left(u\right)=\mathcal{C}({\exp}\left({{\varvec{\zeta}}}^{\boldsymbol{*}}\left(u\right){V}^{T}\right)\), where \(\mathcal{C}(\cdot )\) denotes the closure operation defined as

    $$\mathcal{C}\left({\varvec{x}}\right)=\frac{c}{{\varvec{x}}{1}_{D}^{T}}{\varvec{x}}.$$

The Aitchison geometry version of the Mahalanobis distance is of fundamental importance for the methods introduced in this paper. With a covariance matrix \(S\) as defined above, the (square) Mahalanobis distance between two compositions in the Aitchison geometry is

$${d}_{AM}^{2}\left({\mathbf{z}}_{\alpha },{\mathbf{z}}_{\beta } |S\right)=\left[{\mathrm{ilr}}\left({\mathbf{z}}_{\alpha }\right)-{\mathrm{ilr}}\left({\mathbf{z}}_{\beta }\right)\right]{S}^{-1}{\left[{\mathrm{ilr}}\left({\mathbf{z}}_{\alpha }\right)-{\mathrm{ilr}}\left({\mathbf{z}}_{\beta }\right)\right]}^{T},$$
(2)

analogous to the (square) Aitchison distance

$${d}_{A}^{2}\left({\mathbf{z}}_{\alpha },{\mathbf{z}}_{\beta }\right)=\left[{\mathrm{ilr}}\left({\mathbf{z}}_{\alpha }\right)-{\mathrm{ilr}}\left({\mathbf{z}}_{\beta }\right)\right]{\left[{\mathrm{ilr}}\left({\mathbf{z}}_{\alpha }\right)-{\mathrm{ilr}}\left({\mathbf{z}}_{\beta }\right)\right]}^{T}={d}^{2}\left({\mathrm{ilr}}\left({\mathbf{z}}_{\alpha }\right), {\mathrm{ilr}}\left({\mathbf{z}}_{\beta }\right)\right).$$
(3)

Note that, although dAM is more comfortably defined in terms of the ilr-transformed scores, it is an affine-equivariant quantity, hence intrinsic to the composition and the covariance \(S\), and not dependent on the actual logratio transformation being used. One must merely represent \(S\) in the same transformation used for the composition, according to Eq. (1). This is not true of the Aitchison distance (Eq. 3), which holds only in terms of the ilr or clr transformations. With the Aitchison–Mahalanobis distance, one can define the additive logistic normal distribution (\(ALN\)) as the probability model with density proportional to

$${f}_{Z}\left(\mathbf{z}|\mathbf{m},S\right)\propto {\left(\prod_{i=1}^{D}{z}_{i}\right)}^{-1}{\mathrm{det}\left(S\right)}^{-1}{\exp}\left(-\frac{{d}_{AM}^{2}\left(\mathbf{z}, \mathbf{m}|S\right)}{2}\right).$$
(4)

This probability density function is, as required, also affine-equivariant, owing to the determinant being one of the invariants of \(S\).

3 Cross-Validation, Accuracy and Precision in the Univariate Case

Leave-one-out cross-validation is a well-established tool used to assess a given geostatistical model and to determine the best model from a set of competing models. It is based on kriging, and given a sample data set at each location \(u\), a kriging estimate \({z}_{K}^{*}\left(u\right)\) and corresponding error variance \({\sigma }_{K}^{2*}\left(u\right)\) are derived by removing the location from the set and estimating the value of the variable of interest based on the neighboring data. In jackknifing, a separate validation set is assumed to be available, and the sample data are used to provide estimates or simulated values at the jackknife locations. A more general cross-validation approach, known as \(n\)-fold cross-validation, is to partition the data set into \(n\) disjoint subsets and then apply jackknifing on each of the subsets based on the remaining sample data. Independently of the validation method actually used, one finally has a paired list of observed and estimated values, as well as their kriging variance.

If the true value at \(u\) is \(z(u)\), then the error is given by \(e\left(u\right)={z\left(u\right)-z}_{K}^{*}\left(u\right)\), and the squared deviation ratio is defined as

$$\mathrm{sdr}\left(u\right)= \frac{{\left(z(u){-z}_{K}^{*}\left(u\right)\right)}^{2}}{{\sigma }_{K}^{2*}\left(u\right)}=\frac{{e}^{2}(u)}{{\sigma }_{K}^{2*}\left(u\right)}.$$

Averaging these quantities over all sample locations results respectively in a mean error (ME) and mean squared deviation ratio (MSDR). The former and the associated mean square error (MSE) should be close to 0 and the latter close to 1 if the geostatistical estimator is adequately defined (Webster and Oliver 2007). Typical diagnostic plots are a scatterplot of the true values against the estimates, a histogram of the standardized errors and a scatterplot of the standardized errors against the estimates. Analogously to the linear model, here one expects a tight scatter between estimates and true values, a symmetric histogram of standardized errors with mean close to 0 and variance close to 1 (as a weakened version of standardized normality), and the scatter between standardized errors and estimates showing no correlation (Chilès and Delfiner 2012; Webster and Oliver 2007). Cross-validation is routinely performed in geostatistical practice and used for the appraisal of the variogram and trend model, the local neighborhood and parameters associated with the simulation method applied.

Performance measures in addition to ME, MSE and MSDR concern the quality of the local posterior distributions. Their assessment was first discussed in the context of univariate geostatistical simulation by Deutsch (1997) and is based on the coverage of the local distributions. If the local distribution is \({F}_{u}\) with mean \(\mu \left(u\right)\) and standard deviation \(\sigma (u)\), then the indicator function

$$i\left(u,p\right)=\left\{\begin{array}{cc}1& {\mathrm{if}}\; z\left(u\right)\in \left[{F}_{u}^{-1}\left(\frac{1-p}{2}\right),{F}_{u}^{-1}\left(\frac{1+p}{2}\right) \right]\\ 0& {\mathrm{otherwise}}\end{array}\right.$$
(5)

defined for \(p\in \left({0,1}\right]\) allows measurement of the closeness of the true value to the local mean in the context of a symmetric target distribution. If the simulation algorithm is Gaussian, then the parameters of the local distribution are derived via kriging; otherwise, the local distribution is inferred via the generation of a family of simulated values at the sample location leaving the actual value out. The coverage \(\pi (p)\) is set equal to the average of \(i\left(u,p\right)\) over all available locations \(u\).

To determine the accuracy of the model, a further indicator variable \(a\left(p\right)\) is introduced which is set to 1 if the proportion \(\pi (p)\) of locations falling into the \(p\)-interval exceeds \(p\), and to 0 otherwise. The integral \(A={\int }_{0}^{1}a\left(p\right)dp\) then provides a measure of accuracy. That is, \(A\) measures the proportion of exact or over-pessimistic confidence intervals around the estimates. A useful means for appraising the accuracy of the model is a plot of \(\pi (p)\) against \(p\). An accurate model will result in a plot where the pairs of points \((p,\pi \left(p\right))\) fall above the bisector line. Two measures of precision are defined in Deutsch (1997), one restricted to pairs of points \(\left(p,\pi \left(p\right)\right)\) falling above the bisector line, called precision and defined as \(P=1-2{\int }_{0}^{1}a\left(p\right)\cdot \left(\pi \left(p\right)-p\right)dp\), and the other, called goodness, given by \(G=1-{\int }_{0}^{1}\left(3a\left(p\right)-2\right)\left(\pi \left(p\right)-p\right)dp\). For any value of \(p\) for which the point \((p,\pi \left(p\right))\) lies below the bisector, the departure of \(\pi \left(p\right)\) from \(p\) is penalized in this definition, and high values of \(G\) correspond to precise models. Thus, an accurate and precise model has values of \(A\), \(P\) and \(G\) close to 1. It is important to note that for high accuracy, it suffices that the actual coverage is larger than the nominal one (i.e., \(p<\pi \left(p\right)\)), while precision and goodness also reward proximity to the bisector (i.e., \(|p-\pi \left(p\right)|\) small). The precision measure \(P\) only makes sense in the case of accurate models, while goodness \(G\) is more generally useful.

An alternative approach for assessing the quality of the local posterior distributions was introduced by Olea (2012). In his approach, the symmetric \(p\)-interval used in Eq. (5) is replaced by the unilateral interval \(\left(-\infty , {F}_{u}^{-1}\left(p\right)\right).\) For each \(p\in \left({0,1}\right)\) and each \(u\), an indicator variable is defined by putting

$${i}_{O}\left(u,p\right)=\left\{\begin{array}{cc}1& {\mathrm{if}}\; {F}_{u}^{-1}\left(p\right)>z\left(u\right)\\ 0& {\mathrm{otherwise}}\end{array}\right..$$

For each \(p\), the empirical probability \({p}^{*}(p)\) is set equal to the average of \(i_{O}\left(u,p\right)\) over all available locations \(u\) and plotted against \(p.\) The modeling is optimal if the pairs \((p, {p}^{*}\left(p\right))\) fall on the bisector line, and according to Olea (2012) indicates “perfect global agreement between the modeling of uncertainty and the limited amount of information provided by the sample.” In practice, however, there will be deviations from the bisector, and they can be quantified via the maximum absolute deviation and the sum of absolute deviations between the empirical and theoretical probabilities. It should be noted that the functions \(\pi (\bullet )\) and \({p}^{*}(\bullet )\) are related by \(\pi \left(1-\alpha \right)={p}^{*}\left(1-\frac{\alpha }{2}\right)-{p}^{*}\left(\frac{\alpha }{2}\right)\).

4 The Compositional Case

Here, the implementation depends on whether or not cokriging is applied to derive the parameters of the local distribution. In this case it is assumed that the composition follows an additive logistic normal distribution as expressed by Eq. (4), at least locally, that is, conditional on the estimated composition. As usual in geostatistics, this assumption cannot be formally tested, owing to the presence of spatial dependence, and is to be considered a modeling choice. Otherwise, if conditional additive logistic normality is deemed inappropriate, one should use either multipoint methods or else some form of multivariate transformation to normality (e.g., Barnett et al. 2014; van den Boogaart et al. 2017; Sepulveda et al., under review) followed by Gaussian cosimulation, in which cases Sect. 4.2 applies.

4.1 Cross-Validation, Accuracy and Precision via Cokriging

For compositional data sets, the implementation of the cross-validation procedure via cokriging is straightforward, even if not implemented in most software packages (Tolosana-Delgado and Mueller 2021). At a location to be cross-validated, the entire compositional vector is removed and the surrounding compositions are used to estimate the composition at the sample location via cokriging. As the cokriging is performed in logratio coordinates, the error measures are also calculated in terms of logratios. The mean error is given by

$$\mathbf{M}\mathbf{E}=\frac{1}{N}\sum_{\alpha =1}^{N}\left({\varvec{\upzeta}}\left({u}_{\alpha }\right)-{{\varvec{\upzeta}}}_{CK}^{*}\left({u}_{\alpha }\right)\right)=\frac{1}{N}{\sum }_{\alpha =1}^{N}{\ln}[{\mathbf{z}}\left({u}_{\alpha }\right)/{\mathbf{z}}_{CK}^{*}\left({u}_{\alpha }\right)]{V}^{T},$$

and the associated mean square error is \({\mathrm{MSE}}=\frac{1}{N}\sum_{\alpha =1}^{N}{\Vert {\varvec{\upzeta}}\left({u}_{\alpha }\right)-{{\varvec{\upzeta}}}_{CK}^{*}\left({u}_{\alpha }\right)\Vert }^{2},\) which corresponds to the average Aitchison distance (Eq. 2) between estimates and observations. There are two ways to generalize the mean square deviation ratio to a multivariate quantity (Tolosana-Delgado and Mueller 2021), namely

$$ \begin{aligned} {\text{MSD}}{{\text{R}}_1} & = \frac{1}{N}\mathop \sum \limits_{\alpha = 1}^N \left( {{\bf{\zeta }}\left( {{u_\alpha }} \right) - {\bf{\zeta }}_{CK}^{*}\left( {{u_\alpha }} \right)} \right){{\Sigma }}_{CK}^{ - 1}\left( {{u_\alpha }} \right){\left( {{\bf{\zeta }}\left( {{u_\alpha }} \right) - {\bf{\zeta }}_{CK}^{\text{*}}\left( {{u_\alpha }} \right)} \right)^T} \\ & = \frac{1}{N}\mathop \sum \limits_{\alpha} = 1^N \ln \left( {{\bf{z}}\left( {{u_\alpha }} \right)/{\bf{z}}_{CK}^{*}\left( {{u_\alpha }} \right)} \right){V^T}{{\Sigma }}_{CK}^{ - 1}\left( {{u_{\alpha} }} \right)V\ln {\left( {{\bf{z}}\left( {{u_\alpha }} \right)/{\bf{z}}_{CK}^{\text{*}}\left( {{u_\alpha }} \right)} \right)^T}\quad\quad \end{aligned}$$
(6)

and

$${\mathrm{MSDR}}_{2}=\frac{1}{N(D-1)}\sum_{\alpha =1}^{N}\sum_{i=1}^{D-1}{\Vert {\zeta }_{i}\left({u}_{\alpha }\right)-{\zeta }_{CK,i}^{*}\left({u}_{\alpha }\right)\Vert }^{2}/{\sigma }_{ii}^{2}({u}_{\alpha }).$$
(7)

In the equations above, \(\mathbf{z}\left({u}_{\alpha }\right),\boldsymbol{ }{\varvec{\zeta}}\left({u}_{\alpha }\right), {{\varvec{\zeta}}}_{CK}^{*}\left({u}_{\alpha }\right)\) and \({\mathbf{z}}_{CK}^{*}\left({u}_{\alpha }\right)\) denote the true compositional vector, its logratio image, the logratio estimate and the corresponding backtransform \({\text{at location }} {u}_{\alpha } .\) The expression \({\ln}\left(\mathbf{z}\left({u}_{\alpha }\right)/{\mathbf{z}}_{CK}^{*}\left({u}_{\alpha }\right)\right)\) is an abbreviation of \(\left[{\ln}\left({z}_{1}\left({u}_{\alpha }\right)/{z}_{CK,1}^{*}\left({u}_{\alpha }\right)\right),\dots ,{\ln}\left({z}_{D}\left({u}_{\alpha }\right)/{z}_{CK,D}^{*}\left({u}_{\alpha }\right)\right)\right]\). The matrix \({\Sigma }_{CK}\left({u}_{\alpha }\right)\) denotes the cokriging error variance–covariance matrix at \({\text{location }} {u}_{\alpha }\), and \({\sigma }_{ii}^{2}({u}_{\alpha })\) are its diagonal elements.

Only the mean error \(\mathbf{M}\mathbf{E}\) is a vectorial quantity. All other quantities (\({\mathrm{MSE, MSDR}}_{1}\) and \({\mathrm{MSDR}}_{2}\)) are scalars. The measure \({\mathrm{MSDR}}_{1}\) is nothing other than the average over the square Aitchison–Mahalanobis distances (Eq. 2) \({d}_{AM}^{2}(\mathbf{z}\left({u}_{\alpha }\right) ,{\mathbf{z}}_{CK}^{*}\left({u}_{\alpha }\right)|{\Sigma }_{CK}\left(u\right))\) between \(\mathbf{z}\left({u}_{\alpha }\right)\) and \({\mathbf{z}}_{CK}^{*}\left({u}_{\alpha }\right)\) with respect to the cokriging error variance–covariance matrix. Its target value is equal to \(D-1\). Moreover, under the hypothesis of additive logistic normality of the \(D-\) component compositional random function, the square Aitchison–Mahalanobis distance follows a \({\chi }^{2}(D-1)\) distribution. The version of MSDR in Eq. (7) is the average over the univariate MSDR values for the components of the logratios. In contrast to \({\mathrm{MSDR}}_{1}\), this measure does not have the equivariance property.

The relationship between the Aitchison–Mahalanobis square distances and the \({\chi }^{2}-\) distribution gives rise to a diagnostic tool that may be used in place of the histogram of standardized errors of the univariate case. This is a qq-plot of the observed quantiles of the Aitchison–Mahalanobis square distances against the quantiles of the \({\chi }^{2}-\) distribution with \(D-1\) degrees of freedom. As in the univariate case (where the comparison is against the standard normal distribution), one expects the qq-plot to be close to the bisector. Even if multivariate additive logistic normality does not hold locally for the compositional random function, this diagram will still provide a means to rank competing geostatistical parameter setups (mostly variogram models or kriging neighborhoods) in the sense of their approximation to this distributional assumption, exactly in the same way as for the univariate case.

Other common diagnostic plots are scatterplots of the individual components against their estimates, or of the estimation errors versus the estimates. Compositional versions of these plots are formed by the set of scatterplots between pairwise logratios of estimates against pairwise logratios of true values, and a set of scatterplots of logratio estimation errors against logratio estimates (Tolosana-Delgado and Mueller 2021); both can be conveniently presented in a (\(D\times D\)) matrix of scatterplots.

Compositional analogues of accuracy and goodness described in Sect. 3 are also based on the square Aitchison–Mahalanobis distance of estimates and true values with respect to the estimation error covariance. The definition of the indicator variables required for the calculation of coverage is subject to a certain arbitrariness, because there is no natural ordering of vectorial quantities. In general, any one-dimensional summary of the random composition can be used to generate coverage indicators, as long as the probability distribution of this target quantity is known. A reasonable requirement is for these quantities to be affine-equivariant. The square Mahalanobis distance (Eq. 2) arises naturally as the best option: for each location \(u\) and each \(p\in ({0,1}]\), the indicator function \({i}_{AM}\left(u,p\right)\) is defined as

$${i}_{AM}\left(u,p\right)=\left\{\begin{array}{cc}1& {\mathrm{if}}\; {{\chi }^{2}(d}_{AM}^{2}\left(\mathbf{z}\left(u\right) ,{\mathbf{z}}_{CK}^{*}\left({u}\right)|{\Sigma }_{CK}\left(u\right)\right),D-1)\le p\\ 0& {\mathrm{otherwise}}\end{array}\right.,$$

analogously to Olea’s (2012) proposal. As in the univariate case, the coverage \({\pi }_{AM}(p)\) is defined as the average over all sample locations, and the indicator variable \(a(p)\) is equal to 1 for \({\pi }_{AM}\left(p\right)>p\) and 0 else. The metrics \(A\), \(P\) and \(G\) then have the same definitions as previously, and an accurate and precise model has values of \(A,\) \(P\) and \(G\) close to 1. Other one-dimensional summaries can also be of use: It should be noted that the univariate measures discussed in Sect. 3 may be computed for each relevant logratio variable, each one of them being univariate summaries of the random composition. Even the original variables could be considered in this sense appropriate univariate summaries, if one were ready to obtain the confidence intervals in Eq. (5) by means of Hermite quadrature, as explained in Pawlowsky-Glahn and Olea (2004).

4.2 Cross-Validation, Accuracy and Precision via Simulation

When cross-validation or jackknifing needs to be based on simulation at the sample locations, the definitions of the errors and also the local distributions are based on the simulation results. As before, the errors are in the first instance calculated in logratio coordinates. For \(L\) realizations \(\left\{{{\varvec{\upzeta}}}^{\ell}({u}_{\alpha })|\ell=1,\dots ,L, \alpha =1,\dots ,N\right\}\), we define the local mean

$$\overline{{\varvec{\zeta}} }({u}_{\alpha })=\frac{1}{L}{\sum }_{\ell=1}^{L}{{\varvec{\zeta}}}^{\ell}({u}_{\alpha })$$

and the local covariance as

$$\widehat{\Sigma }\left({u}_{\alpha }\right)=\frac{1}{{L}^{2}}{\sum }_{\ell=1}^{L}{{{({\varvec{\zeta}}}^{\mathcal{l}}\left({u}_{\alpha }\right)-\overline{{\varvec{\zeta}} }({u}_{\alpha }))}^{T}({\varvec{\zeta}}}^{\ell}\left({u}_{\alpha }\right)-\overline{{\varvec{\zeta}} }({u}_{\alpha })).$$

Then, analogously to the cokriging case, one has the mean error given as

$$\mathbf{M}\mathbf{E}=\frac{1}{N}\sum_{\alpha =1}^{N}\left({\varvec{\zeta}}\left({u}_{\alpha }\right)-\overline{{\varvec{\zeta}} }\left({u}_{\alpha }\right)\right)$$

and

$${\mathrm{MSDR}}_{1,sim}=\frac{1}{N}{\sum_{\alpha =1}^{N}\left({\varvec{\upzeta}}\left({u}_{\alpha }\right)-\overline{{\varvec{\zeta}} }\left({u}_{\alpha }\right)\right){\widehat{\Sigma }}^{-1}\left({u}_{\alpha }\right)\left({\varvec{\upzeta}}\left({u}_{\alpha }\right)-\overline{{\varvec{\upzeta}} }\left({u}_{\alpha }\right)\right)}^{T},$$
$${\mathrm{MSDR}}_{2,sim}=\frac{1}{N(D-1)}\sum_{\alpha =1}^{N}\sum_{i=1}^{D-1}\frac{{\left({\zeta }_{i}\left({u}_{\alpha }\right)-{\overline{\zeta }}_{i}\left({u}_{\alpha }\right)\right)}^{2}}{{\widehat{\Sigma }}_{ii}\left({u}_{\alpha }\right)}.$$

In contrast to Eq. (6) where the target value was \((D-1)\), the value of \({\mathrm{MSDR}}_{1,sim}\) for a good model should be slightly greater than this quantity, owing to the fact that \({\widehat{\Sigma }}^{-1}\left({u}_{\alpha }\right)\) is estimated from a set of \(L\) realizations. A reasonable target quantity can be derived from the expected value of \({\mathrm{MSDR}}_{1,sim}\) under the assumption that \({\varvec{\upzeta}}\left({u}_{\alpha }\right)\) is normally distributed conditionally on \(\overline{{\varvec{\zeta}} }\left({u}_{\alpha }\right)\), which produces a Hotelling’s \({T}^{2}\) distribution for the Mahalanobis distance, with parameters \((D-1)\) and \((L-1)\). This gives an expected value of

$$E\left[{T}_{\left(D-1, L-1\right)}^{2}\right]=\frac{(D-1)(L-1)}{L-D+1}E\left[{F}_{\left(D-1, L-D+1\right)}\right]=\frac{\left(D-1\right)\left(L-1\right)}{L-D+1}\times \frac{L-D+1}{L-D-1}=\left(D-1\right)\frac{L-1}{L-D-1},$$

thanks to the equivalence between Hotelling’s \({T}^{2}\) and Fisher \(F\)-distributions and the fact that the expected value of a Fisher \({F}_{(p,q)}\)-distributed variate is \(q/(q-2)\).

For deriving the coverage indicators to calculate accuracies and derived quantities, we can follow the same ideas as in the section about cokriging: one needs to select a univariate summary of the composition, compute that statistic for the true value and generate its probability distribution with the realizations. Again, in general terms, it makes sense to enforce that summary statistic to be affine-equivariant. The indicator variable defining the position of the true compositional vector within the local distribution can then be based on the square Aitchison–Mahalanobis distance: the distance \({d}_{0}\left({u}_{\alpha }\right)\) between the true composition and the mean is compared with the empirical distribution of the distances \({\{d}_{\ell}\left({u}_{\alpha }\right):\ell={1,2},\dots , L\}\) of the simulated results and the mean at \({u}_{\alpha }\)

$${d}_{0}\left({u}_{\alpha }\right)=\left({\varvec{\upzeta}}\left({u}_{\alpha }\right)-\overline{{\varvec{\zeta}} }\left({u}_{\alpha }\right)\right)\widehat{\Sigma }{\left({u}_{\alpha }\right)}^{-1}{\left({\varvec{\upzeta}}\left({u}_{\alpha }\right)-\overline{{\varvec{\zeta}} }\left({u}_{\alpha }\right)\right)}^{T},$$
$${d}_{\ell}\left({u}_{\alpha }\right)=\left({{\varvec{\epsilon}}}^{\ell}\left({u}_{\alpha }\right)\right)\widehat{\Sigma }{\left({u}_{\alpha }\right)}^{-1}{\left({{\varvec{\epsilon}}}^{\ell}\left({u}_{\alpha }\right)\right)}^{T},$$

where \({{\varvec{\epsilon}}}^{\ell}\left({u}_{\alpha }\right)={{\varvec{\upzeta}}}^{\ell}\left({u}_{\alpha }\right)-\overline{{\varvec{\zeta}} }\left({u}_{\alpha }\right)\) for each \(\ell=1, 2, \dots ,{L}\).

The values \(\left\{{d}_{\ell}\left({u}_{\alpha }\right)|\ell=1,\dots ,L\right\}\) are arranged in ascending order \(\left\{{\widehat{d}}_{\ell}\left({u}_{\alpha }\right)|\ell=1,\dots ,L\right\}\), and

$$i\left({u}_{\alpha },p\right)=\left\{\begin{array}{cc}1& {\mathrm{if }}\; {d}_{0}\left({u}_{\alpha }\right)\le {\widehat{d}}_{\lceil pL\rceil}\left({u}_{\alpha }\right)\\ 0& {\mathrm{otherwise}}\end{array}\right.,$$

where \(\lceil pL\rceil\) denotes the smallest integer greater than \(pL\). The statistics \(A,\) \(P\) and \(G\) are then defined as in Sect. 4.1. This construction mimics that of Deutsch (1997) in that in the univariate case, the simulated values are used to derive a local distribution, and the location of the true value is determined relative to it. Note that, as in the case of cokriging, other uni-dimensional summaries may also be meaningful for specific cases: the simulation approach outlined here is particularly useful in these cases because the probability distribution of the target summary can always be derived from the set of realizations.

5 Illustration

In what follows, two applications are provided. The first demonstrates the cross-validation for cokriging of a regionalized subcomposition of the Tellus data (Young and Donald 2013), while the second concerns cross-validation and accuracy assessment in the case of direct simulation for the modeling of the structure of a tailing storage facility.

5.1 The Tellus Data Set

The composition considered in this example (Fig. 1) consists of the components MgO, Al2O3, CaO and Fe2O3 from a sample of the Tellus soil horizon A data (Young and Donald 2013; Tolosana-Delgado and Mueller 2021).

Fig. 1
figure 1

Sample data for the Tellus subcomposition case study; blue indicates low values and red high

The subcomposition was closed through the inclusion of an additional component called Rest and transformed to logratios via the default ilr transform, as described in Tolosana-Delgado et al. (2019). The ilr variables exhibit geometric anisotropy with direction of greatest continuity N135 and a linear model of coregionalization comprised of a nugget, and an exponential structure of range 35 km was fitted, with an anisotropy ratio of 0.4. For the sake of comparison, an isotropic version of this model was also considered, with a range of 26.9 km. Tenfold cross-validation via ordinary cokriging with a moving neighborhood (search radius 60 km, minimum number of samples: 7, maximum number of samples: 20) was applied with both models.

A summary of the performance measures in Table 1 indicates that the anisotropic model is superior to the isotropic one. This is also supported by the graphs of coverage against confidence level in Fig. 2. Additionally, the actual coverage of the compositional model is greater than theoretically expected for the majority of the confidence levels, as evidenced by the value of \(A\) compared to those of the individual components. The accuracy plots in Fig. 2 provide further insight on the behavior of coverage versus confidence for the components and the entire composition. The coverage is generally closer to the chosen confidence level for the individual components compared to the model in its entirety (although the overall accuracy of the compositional model is superior to those of the constituent parts) for values of confidence (\(p\)) up to 0.7; for greater values of confidence, this behavior is no longer observed. This example thus provides support for our claim that evaluation of the performance of a geostatistical model for compositional data should not be based on the performance of the model on the individual components only. Figure 2 also shows that the model ignoring anisotropy appears to have higher accuracy, but at the price of constructing much wider \(p\)-intervals. In the accuracy plot, this is shown through the much greater deviation from the bisector. Accounting for anisotropy notably increases the model precision, as evidenced by the higher value of \(P=0.79\) and \(G=0.88\) in the anisotropic case compared to \(P=0.47\) and \(G=0.74\) for the isotropic model.

Table 1 Values of A, P and G from ordinary cokriging of the Tellus sample subcomposition obtained with the anisotropic model versus isotropic model
Fig. 2
figure 2

Accuracy plot derived from tenfold ordinary cokriging cross-validation of the Tellus subcomposition sample, with an isotropic model (omni) and with an anistropic one (globally, and for each one of the ilr variables based on the anisotropic LMC)

Table 2 gives a summary of the compositional error measures for both the isotropic and anisotropic models. The vector of ilr mean errors (ME) is in both cases very close to zero, with mean square errors (MSE) of roughly 0.4 in both cases. The difference between the two models is apparent in the MSDR measures, where it is clear that the isotropic model overestimates the spread. MSDR2 clearly shows that on average over all logratios, the anisotropic model meets the target of MSDR equal to 1.

Table 2 Compositional error measures obtained by cross-validation with isotropic and anisotropic models

This subcomposition was also studied using direct simulation (Mariethoz et al. 2010), with a validation subset of 100 samples, and the rest used as conditioning data and training image generation, reported in Tolosana-Delgado and Mueller (2021; Ch. 10–11). To complement the discussion in these chapters, the accuracy metrics in the original units (raw data) are shown in Table 2. The results show overly optimistic metrics in general terms, indicating a potential lack of variability in the training image, with generally high goodness values for the individual variables in contrast to a large range of accuracy values (from a low of 0.2 for Fe2O3 to a maximum of 0.95 for MgO). The corresponding compositional metrics have high accuracy but low precision and moderate goodness (Table 3).

Table 3 Values of A, P and G from direct sampling of the subcomposition of interest for the validation sample in the Tellus data set

5.2 The Tailings Storage Facility Case Study

The second example illustrates the usage of the simulation-based approach of Sect. 4.2, by means of a case study with direct sampling (Mariethoz et al. 2010) with a very complex setup, far from additive logistic normality. Here the aim is to model distributions of particle types in a multiple stream tailings storage facility (Selia et al. in prep). Stratigraphic forward modeling is used for training image generation and direct sampling for data fusion with the measured data. A multipoint method was considered as most appropriate because of the strong effects of non-linearity and non-stationarity of the patterns typical of these anthropogenic sedimentary systems. Briefly, the study considers four particle classes, according to size and dominant mineralogy, forming a four-part mass composition: sand-sized silicates (V1), clay-sized silicates (V2), sand-sized sulfides (V3) and clay-sized sulfides (V4). The forward model used is Delft3D-FLOW (Lesser et al. 2004), an open-source, process-based stratigraphic forward modeling software accounting for diffusion, advection in both bed load and suspended load, erosion and compaction in both aerial and subaquatic environments. The parameters of the forward simulation are described in Table 4 (left). Boundary conditions are designed to mimic the behavior of tailings dams, allowing water to seep while retaining sediments. Results were cropped to the basin without the upstream channels and upscaled to \(18\times 21\times 21\) voxels to form the ground truth, which for this study, is also taken to be the training image (Fig. 3). Synthetic boreholes were randomly taken at 36 locations (Fig. 4). Of the resulting 677 samples, 100 were randomly picked for leave-one-out cross-validation.

Table 4 Parameter values for the stratigraphic forward simulation (left) and direct sampling (DS) simulation (right)
Fig. 3
figure 3

Forward simulated four-component system as training image for the tailings storage case study

Fig. 4
figure 4

Synthetic sample data for the tailing storage case study

To predict each of these 100 samples, direct sampling was applied excluding that sample, based on the parameters specified in Table 4 (right), using the Aitchison distance (Eq. 3) as the measure of proximity between the composition at each training image pixel and the data set. The resulting simulations were used to calculate the accuracy and goodness as defined in Sect. 4.2.

As can be seen from the numerical (Table 5) and graphical results (Fig. 5), the simulations show inaccurate but moderately good results: the accuracy curve for both individual ilr coefficients and for the global Aitchison–Mahalanobis summary are systematically below but close to the reference line, particularly for theoretical coverage levels below 0.20. This indicates that confidence intervals tend to be too small to deliver the coverage promised by their nominal confidence. Correspondingly, the numerical values of accuracy A are close to zero, and precision is meaningless (hence not reported in the tables). In this situation, the goodness G becomes useful: G is above 0.81 for all three ilr variables, and the global goodness is the highest (0.89).

Table 5 Values of A and G from simulation-based leave-one-out cross-validation of the direct sampling example (logratio coefficients; global: Aitchison–Mahalanobis distance summary)
Fig. 5
figure 5

Accuracy plot derived from simulation-based leave-one-out cross-validation of the direct sampling example (logratio coefficients; global: Aitchison–Mahalanobis distance summary)

The availability of a simulation allows an evaluation of the accuracy in terms of the original four components. Results are reported in Table 6 and Fig. 6, and do not qualitatively diverge from the logratio-based results: confidence intervals are too large, so that accuracy values A are not useful. However, goodness values are high (above 0.83), suggesting that the model is acceptable in both representations, raw and logratio.

Table 6 Values of A and G from simulation-based leave-one-out cross-validation of the direct sampling example (raw data)
Fig. 6
figure 6

Accuracy plot derived from simulation-based leave-one-out cross-validation of the direct sampling example (raw data)

6 Conclusions

We have provided an extension of quantitative measures of goodness of fit to the multivariate context, particularly for compositional data, which allow ranking of models analogously to already existing univariate approaches. In addition to generalizations of mean errors to vectorial quantities, the average Mahalanobis distance between estimates and observations (MSDR1) gives an additional decision tool for choosing between competing models. This measure explicitly accounts for the multivariate structure, is affine-equivariant and shifts the focus from the individual components to the entire composition. Alternatively, an average mean square deviation ratio over all logratios (MSDR2) can also be used, although this quantity is not affine-equivariant, that is, it depends on which specific logratios are being used to compute the statistic. In principle, both measures focus on different aspects and could result in different rankings of models, although both would result in the choice of the same model in the case studies provided here. The joint accuracy and precision measures introduced here can provide further insights into the system inasmuch as they evaluate the global structure, and may rank competing models differently as when evaluated in terms of just the marginals.

The Mahalanobis distance also provides a reasonable one-dimensional summary of the multivariate distribution, to derive a cumulative distribution and with them measures of accuracy, precision and goodness after Deutsch (1997). Accuracy cannot be evaluated isolated from goodness or precision measures, in either the univariate or multivariate case. This is particularly evident in the metrics for the isotropic versus anisotropic linear models of coregionalization of the Tellus subcomposition, where the accuracy is marginally higher for the isotropic model, but goodness and precision provide a higher contrast to choose the better model. In particular, the goodness metric is important here, as it accounts for both accurate and inaccurate cases. In the case of the tailings storage facility case study, the model overall ended up showing results being too far from the center and hence resulting in an accuracy value of 0. Nevertheless, examination of the actual versus theoretical coverage shows reasonable goodness, with the plot for the global metric quite close to the bisector. The tools could thus still be used for ranking competing models, such as training images or method parameter setups, had several of them been available.