A Variance Partitioning Multi-level Model for Forest Inventory Data with a Fixed Plot Design

Forest inventories are often carried out with a particular design, consisting of a multi-level structure of observation plots spread over a larger domain and a fixed plot design of exact observation locations within these plots. Consequently, the resulting data are collected intensively within plots of equal size but with much less intensity at larger spatial scales. The resulting data are likely to be spatially correlated both within and between plots, with spatial effects extending over two different areas. However, a Gaussian process model with a standard covariance structure is generally unable to capture dependence at both fine and coarse scales of variation as well as for their interaction. In this paper, we develop a computationally feasible multi-level spatial model that accounts for dependence at multiple scales. We use a data-driven approach to determine the weight of each spatial process in the model to partition the variability of the measurements. We use simulated and German small tree inventory data to evaluate the model’s performance.Supplementary material to this paper is provided online.


INTRODUCTION
Designs for collecting spatially oriented data in agricultural, biological, or environmental research often entail multi-level structures where data are collected at very different intensities in different parts of the domain.As an example that will serve as the application of interest later on in this paper, consider the design illustrated in Fig. 1 that arose from a largescale project on biodiversity research in Germany (BIOKLIM Project). 1 As a part of this project, information on forest cover of European blueberry in the Bavarian Forest National Park in Germany was collected for a number of plots distributed over a large spatial range Isa Marques and Paul F. V. Wiemann have contributed equally to this work.I. Marques (B) • P. F. V. Wiemann • T. Kneib, University of Göttingen, Chair of Statistics, Humboldtallee 3, 37073 Göttingen, Germany (E-mail: imarques@uni-goettingen.de).P. F. V. Wiemann, Department of Statistics, Texas A&M University, College Station, TX, USA.(left panel) along altitude gradients where within each plot intensive data collection takes place on locations that are distributed in exactly the same setup around the centroid location of the plot (magnified in the right panel).As a consequence, the actual data are organized according to a specific multi-level structure representing two different spatial scales with large distances between the plots and much smaller distances between the actual locations of observations within the plots.Similar designs are very common, especially in forestry, where forest inventories are probably the most accurate source of data, but are sparsely collected due to the high financial costs involved (see, for example, Bässler et al. 2010;Junttila et al. 2013;Finley et al. 2009Finley et al. , 2011)).
Considering the analysis of forest cover as a dependent variable within a regression scenario, it is most likely not sufficient to relate this to measured environmental factors.It is therefore common practice to add a spatially correlated effect, e.g., based on a Gaussian process, to account for unexplained spatial dependence in the data.However, due to the specific design of the data collection, standard covariance functions are unlikely to be flexible enough to represent the multi-level structure.More specifically, they cannot simultaneously capture spatial variation between and within plots, as spatial variation occurs at completely different spatial scales.Previous work on multi-resolution models relates to our objective of modeling data available at different spatial scales (Nychka et al. 2015;Katzfuss 2017), but we can cast our problem into a more straightforward multi-level model framework directly addressing the data structure of the forest inventories.
The goal of this paper is to develop efficient Bayesian inference with Markov Chain Monte Carlo (MCMC) for spatial regression models acknowledging the different spatial scales arising from the fixed plot design.More specifically, we aim to (1) adequately account for unobserved spatial variation at different scales, (2) allow for interactions between effects at different scales, (3) obtain appropriate uncertainty estimates for the regression effects in the model, (4) obtain predictions at new locations within the observed plots as well as for new plots, and (5) use efficient ways of handling Gaussian process models to make inference tractable.
In our data, there are two different spatial scales to consider: the coarser plot level, where only plot centroids are considered, and the finer within-plot level, which corresponds to the area around the centroids (see again Fig. 1), where the circles (i.e., plots) are assumed to be replicates of each other.This can easily be extended to include more scales, for example if the plots themselves are organized into clusters.We allow any two scales to interact by using Kronecker products of the dependence structures on the two scales.This follows ideas developed in Knorr-Held (2000) and Franco-Villoria et al. (2022) for interactions in space-time models.However, here we extend the concept to the case of two spatial effects at different scales, i.e., a space-space interaction.
For inference, we follow a Bayesian approach based on MCMC simulations.To improve the computational efficiency of the MCMC sampler, we exploit the techniques developed in Stegle et al. (2011), which have rarely been used in the spatial statistics literature.This technique allows for efficient inference in matrix-variate Gaussian models with i.i.d.observation noise by rotation of the data prior to evaluating the multivariate normal likelihood.The resulting (marginalized) likelihood has a diagonal covariance that is easier to factorize than a dense covariance.Indeed, although one can explore the Kronecker product for computational efficiency in spatial models, this is generally not possible in models that additionally include i.i.d.observation noise in the marginalized multivariate normal likelihood.Thus, this technique has benefits that extend beyond our space-space interactions to any other interactions (such as space-time), and speed-up inference with MCMC in any matrix-variate Gaussian model with i.i.d.observation noise.
Finally, our model incorporates a data-driven variance partitioning approach to determine the contribution of each spatial structure (within plot, between plots, interaction) and nugget to the model, thus avoiding the need to postulate the presence or absence of an effect a priori.This also helps to stabilize inference in situations where certain effects are absent, and improves the interpretability of the model.
The paper is organized as follows: In Sect.2, we introduce our novel model more formally.In Sect.3, details on inference are provided, while in Sect.4, we explain how predictions are obtained.A simulation study is provided in Sect. 5. Finally, in Sect.6 we consider a German inventory of European blueberries that exemplifies the usefulness of allowing for interaction between effects on different spatial scales.

FIXED PLOT DESIGNS WITH DIFFERENT SCALES
We consider regression data collected on a spatial domain S ⊂ R 2 .Within S, data are only available at m equally sized areas/plots, S i ⊂ S, i = 1, . . ., m represented for example by the coordinates s i of their centroids.We assume that each plot has the same number of observations y(s i j ), j = 1, . . ., n, located at the same positions relative to the centroid of the plot (see Fig. 1 for a graphical representation of such a structure; in Supplement 6 an additional example is provided), which is, in fact, a prevalent structure in forest inventories.More precisely, let s i j denote the location associated with observation y(s i j ), then ∀i, k ∈ 1, . . ., m and ∀ j ∈ 1, . . ., n, the equality s i j − s i = s k j − s k holds.We refer to such designs as fixed plot designs with different scales.

A SPATIAL REGRESSION MODEL FOR FIXED PLOT DESIGNS
To incorporate spatial variation in a regression model for fixed plot designs, we consider the model equation where y(s i j ) and x(s i j ) represent information on the response variable and the qdimensional vector of covariates, respectively, β are corresponding regression coefficients, and ε i j is an i.i.d.error term.The overall spatial variation is represented by the sum of three spatial effects that corresponds to the spatial variation between plots on the large spatial scale (γ b (s i ) being a function of the centroid locations alone), spatial variation within the plots (γ w (s i j − s i ) being a function of the distance to the centroid alone), and their potential interaction (γ int (s i , s i j − s i ) being a function of both sources of spatial information).In this way, the overall spatial dependence implied by the composed spatial process γ (s i , s i j − s i ) can be much more complex than the spatial dependence of each of the individual components.The idea is to first account for fine-scale spatial structure within the plots via γ w (s i j − s i ).Since this structure does not account for additional large-scale spatial correlation between plots, we superpose the spatial effect γ b (s i ).The superposition of spatial effects allows us to explain both fine-and large-scale spatial dependence, without recurring to more complex and computationally intensive non-stationary spatial models (see, e.g., Lindgren et al. 2011;Nychka et al. 2015).Finally, any remaining interactions between and within plots are accounted for by an additional spatial process γ int (s i , s i j − s i ).More details on the structure of each spatial effect are provided in Sect.2.3.

VARIANCE PARTITIONING PRIORS
Rather than assigning independent priors to the different quantities in model (1), we distribute the variance between spatial effects in a variance partitioning multi-level model (VPMM) specified as where τ > 0 represents the overall variation, while the weights 0 ≤ a b , a w , a int , a ε ≤ 1, subject to a b + a w + a int + a ε = 1 distribute this variation across the four sources of vari-ability (see Fuglstad et al. 2020;Franco-Villoria et al. 2022, for similar variance partitioning specifications).One can think of the weight vector a as implying a joint prior for the nugget effect ε i j and the three spatial effects.Using a joint prior here makes sense because (1) the main and interaction spatial effects in Eq. (2) are typically not independent and (2) for small spatial ranges between and within areas, some components of the main effect in Eq. ( 2) will approximately behave like the nugget.Moreover, from the stand-point of interpretability, interpretation of the relative contribution of each effect is facilitated and the resulting prior is more intuitive to elicit.Assuming that data are organized according to the multi-level structure, Eq. ( 2) can be rewritten in matrix notation as with the vector of observations y, the design matrix X, the block-diagonal matrix Z = blockdiag(1 n , . . ., 1 n ), and the vector of residuals ε ∼ N (0, I) appropriately defined (e.g., ε = (ε 11 , . . ., ε 1n , ε 21 , . . ., ε mn ) and similar definitions for the other quantities).
For the different components in the VPMM, we now make more specific distributional assumptions where zero mean Gaussian random fields (GRFs) will be considered for all spatial effects, besides ε ∼ N (0, I).More concretely, the GRF γ w = (γ w (s 11 − s 1 ), . . ., γ w (s 1n − s 1 ), γ w (s 21 − s 2 ), . . ., γ w (s mn − s m )) describing the spatial variation within each plot is a priori assumed to not be correlated between areas such that where R w is the correlation matrix of size n × n based on the positive-definite exponential covariance function Cor( . ., m and j, l = 1 . . ., n, where κ w is related to the spatial range ρ w of the GRF within the plot (see Chapter 2 in Gelfand et al. 2010).The spatial range is defined as the minimum distance at which the spatial correlation between locations is smaller than or equal to 0.05.Note that in the evaluation of the correlation function, the location of the plot centroid cancels out such that only relative distances within a plot play a role.
The GRF γ b = (γ b (s 1 ), . . ., γ b (s m )) acts as a random intercept for area S i with where R b is a correlation matrix of size m × m based on the positive-definite exponential covariance function Cor(s i , , where κ b is related to the spatial range ρ b of the GRF between plots.Lastly, the interaction term The covariance R b ⊗ R w is positive definite since it results from the Kronecker product of two positive-definite matrices (see Theorem 9 in Horn and Johnson 2012).The Kronecker product represents the interaction between the two spatial scales, as it assumes that the spatial dependence structure within each plot depends on the spatial dependence pattern between all plots.More concretely, it accounts for additional correlation among observations from different plots but close to each other relative to the plots' origin.Such interactions make sense in designs in which the environmental conditions (e.g., soil type) change identically in space within each plot or when an external factor, like wind from one direction or fences, affect all plots in the same manner.In the application, we consider plots located in line transects along an altitude gradient, such that the same locations in different plots have similar inclination and exposition (Bässler et al. 2010).
In the following, we denote γ b (s i ) + γ w (s i j − s i ) the spatial main effects and γ int (s i , s i j − s i ) the spatial interaction effect.For a int = 0, the VPMM model implies the correlation structure Thus, for observations in the same plot, we always have within-correlation, but if the plots are different this correlation is zero.If both the plots and locations within an area are different, we still have between-plot correlation.
The spatial interaction effect implies the pointwise correlation structure Cor(γ int (s i , j, l] to every case in Eq. ( 7)

RELATION TO OTHER DESIGNS
In space-time contexts, one can follow a similar method to the one above.For example, in the case of one spatial resolution and one time resolution, one can adapt Eq. (2) to where i = 1, . . ., m indexes the plots, j = 1, . . ., n is the time index, and (s i , t j ) ∈ R 2 × R, ∀i, j.Moreover, in matrix notation (as introduced in the previous section) where R s is a spatial correlation matrix and R t is a temporal correlation matrix.The novelty in a space-time context is that the computational trick that we introduce in Sect.3.2 can also be used here to reduce the run-time complexity of factorizing the covariance function of the associated (partly marginalized) likelihood to O(m 3 + n 3 ).
We assign a joint Dirichlet prior with parameters α 1 , . . ., α 4 > 0 to the weights a.For notational simplicity, we replace here (a b , a w , a int , a ε ) by (a 1 , . . .a 4 ) such that where B(•) is a multivariate beta function and 4 is the 3-simplex.If any of the weights is 0 or 1, then the density is 0. We set α 1 = α 2 = α 3 = α 4 = 1 such that the prior is uniform and represents no preference for any of the random effects.Furthermore, as we do not sample a directly but sample on the equipotent R 3 , we need to perform a change of variable transformation.The transformation b p and the so-called break proportions c p can be defined element-wise as For the parameters κ b , κ w , we sample the logarithmic counterpart θ b = log(κ b ) and θ w = log(κ w ).The densities are changed accordingly.In what follows, we describe the prior structure for κ b , but the same logic applies to κ w .We assume a normally distributed prior θ b ∼ N (μ κ b , σ 2 κ ).Then, given that for the exponential correlation function the spatial range satisfies ρ b ≈ 3/κ b , from the properties of the log-normal distribution we obtain The p-quantiles of the log-normal distributions for the correlation range are where 0 ≥ p ≥ 1, and (•) is the cumulative distribution function for the standard normal distribution (see Ingebrigtsen et al. 2015, for a similar method).To choose priors, we specify two quantiles of the prior for ρ b .In our case, we focus on the median and 0.95-quantile and then solve the corresponding two equations.We illustrate the prior's behavior in Fig. 2, which is based on the settings used in the simulation study and part of the real data application.

EFFICIENT INFERENCE
This section introduces the technique of Stegle et al. (2011) in the context of our model in order to reduce computational complexity.Consider the marginalized likelihood following Eq.(3) where By integrating out the GRF in a spatial regression model, we typically achieve faster convergence in MCMC samplers (Finley et al. 2015).However, the cost of factorizing the covariance in Eq. ( 9) is cubic in mn.
By instead considering the likelihood with unmarginalized between main effect γ b , we can exploit the structure of a b I mn + (a int I m + a ε R b ) ⊗ R w to reduce computational complexity using a technique introduced in Stegle et al. (2011).With γ b not marginalized, we obtain The evaluation of this multivariate normal distribution requires the calculation of the determinant and inverse of covariance which is a mn×mn matrix with costs O(m 3 n 3 ).These tasks can be accomplished more efficiently by further exploiting the properties of the Kronecker product.
Consider Y ∈ R n×m with n rows and m columns.We define vec(Y ) = y to be the vector obtained by concatenating the columns of Y .A Kronecker product plus a constant diagonal term can then be rewritten as where we can re-formulate the likelihood L in Eq. ( 10) such that This can now be interpreted as a multivariate normal distribution with diagonal covariance matrix τ 2 a ε I mn + S b ⊗ S w and rotated data vec(U w Y U b ) (Stegle et al. 2011).
The factorization of the diagonal covariance matrix implies a lower run-time complexity than that of the dense counterpart.Moreover, although we need to calculate two eigenvalue decompositions, in general, we can perform factorizations on the smaller matrices, reducing costs to O(m 3 ) and O(n 3 ), respectively.These two operations can additionally be parallelized.Ultimately, without parallelization, this reformulation has computational complexity of O(n 3 + m 3 ), rather than O(n 3 m 3 ) in a global spatial model and in the scenarios we are interested in; i.e., scenarios with n ≥ 2 and m ≥ 2, n 3 + m 3 < n 3 m 3 are guaranteed.

SAMPLING
In the partially marginalized formulation of VPMM introduced in the previous section, we update γ b and ϑ with an alternating scheme: Update of ϑ.For efficient sampling, we use proposals based on Hamiltonian dynamics with a subsequent Metropolis-Hastings correction known as Hamiltonian Monte Carlo (HMC, Neal 2011).In each case, the step size and the mass vector are learned during warm-up.We find that in some data settings the gradient of the unnormalized log-posterior with respect to log(κ b ) and log(κ w ) is numerically unstable and better results are obtained when removing those parameters from the HMC step and instead sample them with the Metropolis-Hastings algorithm using random-walk proposals.Similar to the HMC-based sampler, the step size of the random-walk proposals is tuned during warm-up.
Update of γ b .Here, we use Gibbs sampling and draw γ b from the full conditional (see Supplement 1).

SOFTWARE
The model is implemented in Python using the novel Liesel framework for Bayesian computation (Riebl et al. 2022).In particular, we use Goose, the MCMC library of Liesel.Goose provides a set of efficiently implemented and well-tested samplers capable of learning some tuning parameters, such as the step size, during warm-up.Different samplers can be associated with different parts of the parameter vector, allowing us to implement the sampling procedure described in Sect.3.3 with minimal effort.Liesel facilitates using gradient-based samplers (e.g., HMC and NUTS) by taking advantage of automatic differentiation, which allows us to implement only the unnormalized log-posterior.However, using Liesel, we can-where necessary-integrate dedicated implementations incorporating the computational tricks discussed.

PREDICTIONS AT NEW LOCATIONS WITHIN THE OBSERVED PLOTS
These types of predictions seem particularly valuable as foresters could thin out their data collection process within each plot or compensate any missing values within a plot.Consider observations y i j available in each plot i = 1, . . ., m at the same locations indexed with j = 1, . . ., n and predictions at t ∈ N new locations in each plot indexed with j = n + 1, . . ., n + t.For notational clarity, we write y i, j instead of y(s i, j ) in the remaining part of this section.
To predict a random mt × 1 vector y 0 = (y 1,n+1 , . . ., y 1,n+t , . . ., y m,1 , . . ., y m,n+t ) associated with a mt × p matrix of predictors, X 0 , we start with the joint distribution of y = (y 1,n+1 , . . ., y 1,n+t , y 1,1 . . ., y 1,n , . . ., y m,n+1 , . . ., y m,1 , . . ., y m,n ) .Moreover, we have y 1 = (y 1,1 , y 1,2 , . . ., y m,1 , . . ., y m,n ) .The matrices X, Z, and R w , shall denote the design matrix, projection matrix, and withincorrelation matrix similar to X, Z, R w , but augmented such that they include the new values associated with y 0 .Now, the joint distribution of y given the model parameters ϑ and the between area effect γ b is The (n + t) × (n + t) correlation matrix R w can be expressed as a block-matrix with the correlation matrices describing the within-correlation of the new observations and the old observations on the diagonal and the correlation matrix between those on the offdiagonal.The conditional distribution of the predictions is given by Here, μ 0 and μ 1 refer to the components of the mean vector in Eq. ( 11) suitable to express the mean of y 0 and y 1 , respectively.Similar, the blocks kl , k, l = 0, 1 arise from the covariance matrix in Eq. ( 11) referring to the conditional covariance of y k and y l .Note, Bayesian prediction proceeds by sampling from the posterior predictive distribution p( y 0 | y) = p( y 0 | y, ϑ, γ b ) p(ϑ, γ b | y)dϑdγ b .For each posterior sample of (ϑ , (γ b ) ) , we draw y 0 from the corresponding distribution (see Eq. ( 12)).

PREDICTIONS AT NEW PLOTS
Predictions can also be constructed for new plots.Suppose we want to predict t ∈ N new plots.To predict a random tn×1 vector y 0 = (y m+1,1 , y m+1,n , . . ., y m+t,1 , . . .y m+t,n ) associated with a tn × p matrix of predictors, X 0 , we start with the joint distribution of y = (y m+1,1 , . . ., y m+1,n , . . ., y m+t,n , y 1,n , y 1,2 , . . ., y m,n ) , where y 1 = (y 1,1 , . . ., y 1,n , . . ., y m,n ) .The matrices X, Z, R b , γ b shall denote the design matrix, projection matrix, and between correlation matrix similar to X, Z, R b , but augmented such they include the new values associated with y 0 .We can factorize Assuming t < m, both predictive and model density can be evaluated in O(m 3 + n 3 ) using Stegle's method.Both factors of the predictive distribution relate to a conditional normal distribution.The second term conditions on an m-dimensional vector, thus requiring the factorization of an m × m matrix, which can be done in O(m 3 ).The first term is more involved.Consider, the joint distribution of y given the model parameters ϑ and γ b y|ϑ, The (m + t) × (m + t) correlation matrix R b can be expressed as a block-matrix The conditional distribution of the predictions follows similar to the previous section, with the blocks kl , k, l = 0, 1, forming the covariance matrix from Eq. ( 13).Once again, all terms involving the inversion of 11 can be efficiently computed using Stegle's method, thus reducing the computational complexity from O(m 3 n 3 ) to O(m 3 + n 3 ).
In the absence of an interaction effect, the predictive distribution of y 0 allows the insight that the expected value of y 0 is constant within each plot.Moreover, the predictive distribution suggests a potential reduction in the uncertainty of the predicted values in the presence of an interaction effect.Therefore, an improvement in the predictions compared to a model focusing only on between-plot effects is expected if a considerable share of the total variance is attributed to the interaction.

SIMULATION
In this section, we present a simulation study that evaluates the performance of the VPMM in terms of the bias of all estimated parameters, as well as the corresponding number of MCMC effective samples (Geyer 2011).We assume that the Data Generation Model (DGM) and the Data Analysis Model (DAM) are identical and follow the VPMM.We evaluate the performance of the VPMM for: (1) increasing sample sizes m and n; (2) different true weight vectors a; and (3) increasing variance τ 2 .Objective (1) is to find thresholds for m and n at which the parameters of the model can be accurately estimated.Objective (2) aims to identify potential identification problems between the multiple spatial effects, or any tendency of the spatial effects to degenerate into i.i.d.processes, even if the priors on the range parameters avoid small values.Finally, in (3) we investigate how different variances affect the estimated parameters.

DGM
We expect the models to perform well for n relatively smaller than m, since the observations within-plot have m replicates.Given this, we consider m ∈ {30, 40} and n ∈ {10, 25}.These are also close to the sample sizes in Sect.6 (see also Supplement 6).We consider τ 2 ∈ {1, 2} and the partitionings of the variance a = (a b , a w , a int , a ε ) are such that a ∈ {(0.35, 0.35, 0.2, 0.1) , (0.25, 0.55, 0.05, 0.15) , (0.70, 0.05, 0.05, 0.20) }.The first vector of weights represents a well-behaved scenario that we expect should be easy to estimate for any reasonable sample size.The second vector of weights sets the interaction weight close to zero, a scenario that is realistic for data structures that do not lead to stronger correlation for the same locations in different plots.The scenario with a int = a w ≈ 0 represents a standard model used in forest sciences for inventory data, where one simply has a random intercept for the plots, although this spatial effect is typically not spatially correlated.This scenario also aims at identifying any potential identification issues between spatial effects or tendency to degenerate, e.g., the within effect degenerates to white noise by having low values for the spatial range, instead of being assigned a weight of zero.Some parameters are kept fixed: κ b = 3/0.5,κ w = 3/0.7,β 1 = 1, and β 2 = 0.5.Moreover, x(s i j ) ∼ N (0, 1).We consider 50 replicates.

DAM
The prior hierarchy follows Sect.3.1.Since we resize every S and S i such that S ⊂ [0, 1] × [0, 1] and S i ⊂ [0, 1] × [0, 1] ∀i, we set ρ b (0.95) and ρ w (0.95) in Eq. ( 8) to the maximum diameter of the corresponding space; i.e., ρ b (0.95) = ρ w (0.95) = 1, and ρ b (0.5) = ρ w (0.5) = 0.5.We run two MCMC chains, each with 2000 MCMC samples and with a warm-up of 2000 samples.Convergence is confirmed by verifying that the R-hat (Gelman and Rubin 1992) is smaller than 1.1, as well as by checking the smallest effective sample size out of all the model's parameters, based on the median effective sample size for all MCMC samples (Geyer 2011;Gelman et al. 2013).

RESULTS
Results are summarized in Figs. 3 and 4. The main conclusions are the following: Sample size and weights: Scenarios with a = (0.35, 0.35, 0.2, 0.1) lead to unbiased estimates of all parameters for all sample sizes.The same is true for a = (0.25, 0.55, 0.05, 0.15) , except for n = 10 where there is a slight tendency for the within weight to be underestimated and the interaction weight to be overestimated.In the same direction, for a = (0.70, 0.05, 0.05, 0.20) the within-plot range is underestimated for n = 10, although it remains far from zero.This underestimation ultimately leads to a slightly biased weight for the within and nugget weights, suggesting some tendency for the within effect to behave similarly to the nugget for situations in which it has a low weight and n is small.However, the priors used for the range prevent the degeneration of the within effect to white noise.Given n, both values of m behave similarly well, indicating that m = 30 is already large enough to recover all true model parameters.All in all, for n = 10, some parameters might be slightly biased for less well-behaved scenarios (some weights close to zero), but a sample size n = 25 is sufficient to recover unbiased estimates of all parameters.

Variance:
The two values for variance τ 2 lead to nearly identical results for the distribution of the bias of all parameters, except for the dispersion of β 2 , which is larger for larger τ 2 .Thus, we restrain from presenting these results in the main text (see Supplement 5).

Convergence:
The smallest median effective sample size is far above 100 for all scenarios.We follow the argumentation of (Gelman et al. 2013, p. 267), considering it enough for "reasonable posterior summaries" and, in particular, for posterior mean estimates.The R-hat value is also below 1.1 for all the results presented.Note that no thinning was used.

APPLICATION
We consider a German forest inventory dataset from the BIOKLIM Project. 2 We model forest cover of Vaccinium myrtillus, also known as European blueberry.The data were collected in the Bavarian Forest National Park in m = 30 plots of 200m 2 .In each plot, there are n = 8 observations distributed on a circle and equally spaced.The structure of the data within and between plot is shown in Fig. 1.
The plots are distributed along four straight transects following the altitude gradient, such that the inclination should be roughly similar for the same location in different plots (as implied by the spatial effect γ int (•) in Eq. ( 2)).The existence of distribution patterns along altitudinal gradients at large spatial scales remains disputed, partly because most models to date ignore potential spatial dependencies (Bässler et al. 2010).However, data collected along a transect with neighboring sampling points are likely to be spatially correlated.Thus, it makes sense to account for spatial dependence at both larger and smaller scales.

GENERAL SETTING
For the application, the VPMM follows the structure where y(s i j ) is the forest cover, which is subject to a transformation f (•).Particularly, f (y(s i j )) = (h • g)(y i j ), such that g(y(s i j )) = log(y(s i j ) + 1) and h(•) additionally standardizes g(y(s i j )).We include standardized elevation (elev) as covariate in the model.Elevation is only available at the plot level and thus it is indexed by s i .
We run two MCMC chains, each with 5000 MCMC samples, including a warm-up of 2000 samples.Convergence is confirmed by verifying that the R-hat (Gelman and Rubin 1992) is smaller than 1.1 and by checking the smallest number of effective samples out of all parameters.

EVALUATION CRITERIA
To assess the quality of the predictions for new locations and plots, we consider the mean squared error (MSE) in a leave-t-out cross-validation (CV) setting.Additionally, we also consider logarithmic score.Consider the case of new locations within a plot.The case of new plots follows similarly.To obtain the CV-MSE, the data are divided into training and test data by randomly selecting t from the n available within the plot locations for the test data.The remaining locations are used for training.This is repeated until there are fewer than t observations available that were not previously used for testing.The quality of the predictions is assessed using the posterior mean of the MSE with respect to the conditional mean (CV mean) and the posterior predictions (CV sample) (see Sect. 4).We choose t = 1 for within-plot predictions which implies roughly 12.5% of the data is used for testing.For predictions on new plots we use t = 3 which corresponds to 10% of the plots.Additionally, we consider cases of the VPMM with a int = 0 and a int = a w = 0 with adjusted prior for a.The full-sample logarithmic score (log score) follows log 1 S S s=1 p y|ϑ (s) , where ∫ = 1, . . ., S are MCMC samples, and ϑ (s) denotes the s-th MCMC sample of ϑ.Compared to Eq. ( 10) we also marginalize γ b (•).The between effect is also marginalized in the model from Eq. ( 15).The full-sample log score omits the leave-one-out idea, as it has been shown that the full-sample option can have a better small-sample model discrimination ability than the cross-validated one (Krnjajić and Draper 2014), and it is computationally cheaper than doing CV.

RESULTS
Recall that we resize S and every S i ∀i ∈ {1, . . ., m} (see Sect. 6.1 and Fig. 5).The results are shown in Tables 1 and 2. In the VPMM, the posterior mean of τ 2 is 1.15.The results indicate that approximately 15% of the variance is attributed to the between effect, 35% to the within effect, 15% to interaction effect, and 34% to the nugget.The spatial range is 0.52 for the within effect.In Fig. 6, one can confirm that the spatial dependence within plot is mostly present for direct neighbors.Large-scale dependence is also present, as the model leads to a spatial range of 0.30 (approximately one-third of the edge length of the unit square) for the between effect, which covers most of each respective transect.Moreover, as expected, the interaction effect plays a relevant role.Since the plots are located along altitude gradients, the same locations on different plots are thought to have similar inclinations, thus inducing spatial correlation that can be explained by the space-space interaction.
Concerning the non-spatial multi-level model, the mean variance of the random intercept on the non-spatial model is 0.09 and thus rather small, given that the response is standardized.The remaining variance is attributed to the nugget.The credible interval (C.I.) of elevation includes zero in both models.Thus, when interpreting the results, the non-spatial multi-level model seems rather inappropriate for these data, since most of the behavior is explained by the nugget.
The evaluation criteria also point in the direction of a better performance of the VPMM.Indeed, the log score is higher for the VPMM and all CV-MSEs are lower (or equal at one instance) for the VPMM compared to the non-spatial model.In general, while the VPMM often does not outperform the three competitors in terms of the mean CV-MSE, significant differences are visible for the sample version.This might be due to the fact that the within and interaction effects are marginalized in our model, such that the sample version more clearly shows the differences in the two models.As speculated in Sect.4.2, the interaction effect is particularly helpful in predictions for new plots.

DISCUSSION
In this paper, we develop a computationally feasible multi-level spatial model which accounts for dependence at multiple spatial scales-the VPMM.The model presented includes a data-driven approach to determine which (spatial) effects are relevant for a specific dataset.The results of the simulation study show that we can recover all true parameters of the VPMM, given a sufficiently large within-plot sample size (shown for n ≥ 25).In the applications, we also demonstrate how the VPMM fulfills its purpose of improving interpretability of irregular spatial data, by providing separate range of parameters for different scales.
Future work should consider additional extensions to the VPMM.First and foremost, the current version of the model assumes the same set of locations within each plot.This assumption should be extended to flexibly deal with any sampling design in continuous space by, for example, using basis functions approaches within plot such as in Lindgren et al. (2011); Morris et al. (2019).Such an extension would also make predictions at new plots or different locations within each plot much more flexible.We suggest first steps in the Supplement 4.
Second, forest inventory data are often collected coarsely over time.Therefore, an extension of the VPMM toward space-time which further exploits the method in Stegle et al. (2011) could be investigated.A first tentative outline is presented in the Supplements 2 and 3. Indeed, in general, the technique used to reduce the computational complexity of the model by reformulating the normal likelihood could also be used in a space-time context.
Concerning the prior structure, it would make sense to extend the joint prior for the random effects to the fixed effects.There is, however, a need to rethink the concept of total variance since the amount of variance explained by fixed effects is determined by their coefficients, not their variances.It is worth noting that, although we present a forestry example, the resulting methods can be applied to potentially many areas of research where data of a similar structure are collected (e.g., agriculture).

Figure 1 .
Figure 1.Locations of 30 identically sized plots distributed over a large spatial range (left panel) with identical distribution of observation locations within the plots (zoom in for the area represented by the red circle on the left in the right panel).The plots are located along straight transects following the altitude gradient.The data collected included more plots, but we restricted the dataset to plots of European blueberries.

Figure 2 .
Figure 2. Density of the prior for spatial ranges.We consider ρ b (0.95) = 1 and test different values for the range ρ b (0.5) as shown in the legend.It follows similarly for ρ w .

Figure 3 .
Figure 3. Boxplots of the estimated posterior mean of ρ w and ρ b calculated over 50 replicates for scenarios with τ 2 = 2. On the x-axis, we show the different sample sizes (m, n).The columns show three different scenarios with different true weights a = (a b , a w , a int , a ε ) and the rows show the estimated values for each range ρ.The dashed lines show the true values.

Figure 4 .
Figure 4. Boxplots of the estimated posterior mean of a = (a b , a w , a int , a ε ) calculated over 50 replicates for scenarios with τ 2 = 2. On the x-axis, we show the different sample sizes (m, n).The columns show three different scenarios with different true weights a and the rows show the estimated values for each weight.The dashed lines show the true values.

Figure 5 .
Figure5.We rescaled the domain S to the unit-square for interpretability purposes.

Figure 6 .
Figure 6.Data within each of the 30 plots in application.From each observation, we removed the mean of each corresponding plot .

Table 1 .
Posterior mean estimates and equal-sided 90% credible interval for VPMM and non-spatial multi-level model

Table 2 .
Mean and sample-based CV criteria for the models, where a int = 0 and a w = a int = 0 correspond to the VPMM with these weights set to zero