Pivotal discrepancy measures for Bayesian modelling of spatio-temporal data

Within the field of geostatistics, Gaussian processes are a staple for modelling spatial and spatio-temporal data. Statistical literature is rich with estimation methods for the mean and covariance of such processes (in both frequentist and Bayesian contexts). Considerably less attention has been paid to developing goodness-of-fit tests for assessment of model adequacy. Jun et al. (Environmetrics 25(8):584–595, 2014) introduced a statistical test that uses pivotal discrepancy measures to assess goodness-of-fit in the Bayesian context. We present a modification and generalization of their statistical test. The initial method involves spatial partitioning of the data, followed by evaluation of a pivotal discrepancy measure at each posterior draw to obtain a posterior distribution of pivotal statistics. Order statistics from this distribution are used to obtain approximate p-values. Jun et al. (Environmetrics 25(8):584–595, 2014) use arbitrary partitions based on pre-existing spatial boundaries. The partitions are made to be of equal size. Our contribution is two-fold. We use K-means clustering to create the spatial partitions and we generalise Jun et al.’s approach to incorporate unequal partition sizes. Observations from a spatial or spatio-temporal process are partitioned using an appropriate feature vector that incorporates the geographic location of the observations into subsets (not necessarily of the same size). The method’s viability is illustrated in a simulation study, and in an application to hoki (Macruronus novaezelandiae) catch data from a survey of the sub-Antarctic region.


Introduction
The literature that surrounds the subject of spatial and spatio-temporal statistics is predominantly concerned with parametric inference for the covariance structure. Within geostatistics (where data are observed at specific locations in time), a great deal of attention has been paid to proposing and describing new spatial and spatio-temporal models, studying their characteristics, and developing estimation methods from within both frequentist and Bayesian frameworks.
A variety of examples can be found in the literature. For instance, parametric spatial and spatio-temporal covariance models that assume stationarity have been developed and estimated in Cressie and Huang (1999), Gneiting (2002), and Stein (2005). Cressie and Huang (1999) derived theoretical results on positive definiteness of the stationary spatio-temporal covariance function. This led to the proposal of a class of non-separable parametric covariance functions, which they applied to the problem of mapping the east-west component of wind speed over a region in the tropical western Pacific Ocean. Gneiting (2002) extended the work of Cressie and Huang (1999) further and provided a more general class of covariance functions that do not depend on Fourier transform pairs. Their work was illustrated through an application to Irish wind data. Stein (2005) extended these ideas further to consider spherical functions, applied to the same Irish wind dataset used in Gneiting (2002). Non-stationary approaches have been proposed and investigated in Paciorek (2013), Ecker et al. (2013), and Fouedjio (2017). A Bayesian hierarchical approach to modeling spatial and spatio-temporal covariance functions was taken in Cameletti et al. (2011), Cameletti et al. (2013), Sahu and Bakar (2012), and Banerjee et al. (2014). Each of these papers makes use of the flexible Matèrn class of covariance functions. White and Ghosh (2009) extended the conditional autoregressive model for areal data proposed in Besag (1974), to geostatistical data, and introduced the stochastic neighbourhood conditional autoregressive model. An overview of the state of Bayesian hierarchical spatio-temporal models is given in Gelfand and Banerjee (2017).
Markedly less attention has been paid to developing goodness-of-fit tests that identify model mis-specification or allow for selection of a "best" model. Model mis-specification in the context of parametric covariance models for spatio-temporal processes refers to models that have an incorrect mean function or covariance structure. At present, there is no generalized formal theory for assessing goodness-of-fit for spatio-temporal models that are defined using parametric covariance functions. Instead, there is a range of criteria and tests that have been used when fitting a spatiotemporal covariance model to data.
Literature has seen the use of Akaike information criterion, AIC (Akaike 1973) and Bayesian information criterion, BIC (Schwarz 1978), which are popular model selection tools for a wide range of frequentist analyses. Huang et al. (2007) proposed model comparison for space-time models using these criteria, and investigated their usefulness through simulation and an application to surface shortwave radiation budget analysis. Another criterion, deviance information criterion, DIC (Spiegelhalter et al. 2002), was used by Pollice (2011) to compare multivariate receptor models for identifying the spatial locations of major PM 10 pollution sources. To compare predictive capabilities, use of mean square and root mean square prediction errors at fixed times were illustrated in Huang et al. (2007). Further, Sahu and Bakar (2012) applied the predictive model choice criterion (PMCC), which included a term for model complexity. In more recent times, we have seen the proposal and use of the widely applicable information criterion, WAIC, (Watanabe 2010). The WAIC is an information criterion constructed in the same vain as DIC, but is fully Bayesian as it uses the entire posterior distribution, unlike the DIC which is based on a point estimate. Vehtari and Gelman (2014) and Vehtari et al. (2017), adopted WAIC as a method for approximating leave-one-out cross validation for model goodness-of-fit.
These model selection/goodness-of-fit criterion are inappropriate for some spatiotemporal models. The WAIC relies on a partition of a data set into individual data points, and its computation involves a sum of the log-posterior evaluated over all data points and posterior draws. For spatial or spatio-temporal data, partition into individual data points is not valid due to the inherent structure in the data ). In addition, Jun et al. (2014) illustrated that global goodness-of-fit tests applied to the entire dataset provide limited power to detect model mis-specification when applied to spatial data. Bastos and O'Hagan (2009) proposed numerical and graphical diagnostic tools for Bayesian model checking in the context of Gaussian processes. A Bayesian approach for assessing goodness-of-fit for Gaussian random fields (GRFs) based on pivotal discrepancy measures was introduced by Johnson (2007) and further discussed in Yuan and Johnson (2012) and Jun et al. (2014). The approach can be used for GRFs with stationary and nonstationary covariances and to data observed at regular or irregularly spaced locations. More recently, Lobo and Fonseca (2020) use a cross-validation approach to assess goodness-of-fit of spatial models. These approaches, however, consider only spatial data. In this paper, we extend the approach described in Jun et al. (2014) to assess goodness-of-fit of Bayesian spatio-temporal models using pivotal discrepancy measures. Jun et al. (2014) proposed the method of partitioning the data to increase the power to detect model mis-specification, but used equal partition sizes. The primary innovations in this article are to extend the Jun et al. (2014) approach to partitions of unequal sizes and the use of K-means partitioning as a method for inducing homogeneity within partitions, when there are no pre-set spatial boundaries.
The paper is divided in to the following sections. We introduce and define the general spatio-temporal model in Sect. 2. In Sect. 3, we present the pivotal discrepancy measure for a spatio-temporal model evaluated at a sample from the posterior distribution. Further, we present the pivotal discrepancy measure for subsets of data of unequal size. Section 4 is dedicated to evaluation of the method using a simulation study. This is followed in Sect. 5 by an application and assessment of spatio-temporal models to hoki catch weight data. We discuss the findings of the paper in Sect. 6 2 Spatio-temporal Gaussian process model We are interested in modelling the covariance structure of an observed univariate spatio-temporal process {y(s, t) : (s, t) ∈ R d × R}. The Gaussian process model is a commonly used model due to its flexibility in modelling the effect of relevant covariates as well as time and space dependence and is defined in several articles. Following the notation of Cameletti et al. (2011) and Sahu and Bakar (2012), we assume that an observation y(s i , t) measured at location s i , where i = 1, . . . , n and time t = 1, . . . , T , can be modelled by a Gaussian process with measurement equation, ) denotes the ( p+1)-dimensional vector of covariates for location s i at time t, β = (β 0 , β 1 , . . . , β p ) is the coefficient vector, and n and T are the numbers of spatial locations and time points respectively. The residual is partitioned into two components, one spatial (Z (s i , t)) and one non-spatial (ε(s i , t)). The measurement error (nugget effect), ε(s i , t), is modelled independently as a white noise process, N(0, σ 2 ε ). Lastly, Z (s i , t) is a realization of a latent spatio-temporal process that is modelled by a Gaussian process that changes in time with first order autoregressive dynamics, and coefficient ρ as follows, Furthermore, ω(s i , t) is modelled by a zero-mean Gaussian distribution, in which we assume temporal independence. It is characterized fully by the spatio-temporal covariance function, where i = j, and t and t * are two different time points. The parameter σ 2 ω denotes the spatial variance and R(.) is a correlation function that depends on parameter vector φ, such that the resulting correlation matrix, R is positive definite. Note that R is an n × n matrix with elements R(s i , s j ; φ). We implicitly make the assumption that the overall spatial covariance structure of the data is constant over time. Also note that we make no assumption of spatial stationarity or isotropy, as evidenced by the spatial covariance function in Eq. 1.
By collecting all the observations measured at time t in a vector denoted by y t = (y(s 1 , t), . . . , y(s n , t)) , we can write s 1 , t) , . . . , x(s n , t) ) , σ 2 ε is the variance of the nugget effect, and I n is the identity matrix with dimension n. As before, the spatio-temporal process is decomposed into spatial and temporal terms, for t = 1, . . . , T and positive definite correlation matrix R. Also let θ = (β, σ 2 ε , ρ, σ 2 ω , φ) denote the vector of parameters. It is then implied that for t = 1, . . . , T , and for t = 2, . . . , T , and that Z 1 comes from the stationary distribution of the AR(1) process, From Eqs. 2-4 we can then write that the marginal distribution of y t (given the parameters) is We now present a pivotal discrepancy measure for spatio-temporal data y t based on a posterior sample of the parameter vector θ.

Pivotal discrepancy measures
Letθ (m) represent the mth draw of the parameter vector θ from the posterior distribution π(θ| y), where y = { y 1 , . . . , y T }. We can construct a pivotal quantity, where M is the total number of posterior draws.
Then S( y t ,θ (m) ) is χ 2 -distributed on n degrees of freedom (Johnson 2007). Jun et al. (2014) highlighted two complications that arise when Eq. 6 is used in a Bayesian goodness-of-fit test for spatial models, which also arise for the spatiotemporal case. The first complication is that a test based on the test statistic in Eq. 6 typically provides little power to detect model misspecification when it is applied globally to the entire data vector. Jun et al. (2014) illustrated this through an example where a simple Bayesian linear regression is fitted to a fictional dataset that exhibited larger variability at the extreme values of a covariate, and smaller variability around the mean of the covariate. Their test, based on the spatial equivalent of the test statistic in Eq. 6, was unable to detect departure of data from the simple linear model. This was due to the cancellation of the large and small contributions from the residuals when the statistic S( y t ,θ (m) ) was applied to the entire data set. Jun et al. (2014) proposed a partitioning strategy, where the chi-squared diagnostic was constructed using residuals from distinct regions of the spatial domain. Use of the partitioning strategy allowed lack of fit of the model to the data in each partition to be correctly detected and the goodness-of-fit test failed. Partitioning of the data was further motivated in a simulation test and applications to Colorado precipitation data and total column ozone data. We propose an extension of their strategy in Sect. 3.1. The second complication is how to combine the pivotal discrepancy measures based on many posterior draws and a partitioned dataset when conducting a goodness-of-fit test. A single posterior draw,θ (m) , from the posterior distribution based on a nonpartitioned dataset gives the statistic S( y t ,θ (m) ) ∼ χ 2 n . Each posterior draw gives a different value of the test statistic and these values will be correlated. Jun et al. (2014) proposed diagnostics based on bounds on the distribution of order statistics (Caraux and Gascuel 1992;Rychlik 1992) to carry out a goodness-of-fit test that makes use of multiple correlated statistics obtained from the posterior draws. We adopt that approach in this article.

Partitioning the observed locations into C subsets (not necessarily of equal size)
Jun et al. (2014) proposed partitioning the set of observed locations into C subsets of equal size w and showed that partitioning the observation vector into regions of high and low variability allowed the test to detect model mis-specification. They suggest partitioning based on either prior knowledge regarding regions of likely homogeneity, or according to well defined spatial boundaries. We extend their approach to consider partitions of spatio-temporal data, with partitions of the spatial domain of unequal sizes, w j , j = 1, . . . , C. Applying Eq. 6 to the partitioned spatio-temporal data gives: for t = 1, . . . , T , j = 1, . . . , C, and m = 1, . . . , M, where y t j , μ t j , and R j denote the parts of Eq. 6 corresponding to subset j, and w j is the number of observed locations in subset j.
In addition to allowing the subsets to vary in size, we propose use of the K-means clustering algorithm (Alsabti et al. 1997) to partition the spatial domain in cases where there are no a priori well-defined spatial boundaries. We design the algorithm (see Algorithm 1 below) to minimise intra-cluster distances from the cluster centroid, thus creating subsets of locations that are likely to have more heterogeneity than the entire domain as a whole. Other clustering algorithms can be used and we consider these alternatives briefly in the Discussion section.
Algorithm 1 K-Means clustering 1: Choose an appropriate number of subsets, C, to partition the data into. 2: Randomly select the initial cluster centroids, γ 1 , . . . , γ C ∈ R d , where γ j is a d-dimensional location vector that will result in subsets that each exhibit more homogeneity than the full dataset. For example, γ j = s j , where s j is the location vector for the centroid of the j th subset. 3: Repeat until convergence: where u i is the location vector for observation i.

Nominal distribution of the ordered pivotal statistics
The screening diagnostics we use are based on bounds of order statistics given in Proposition 3 in Caraux and Gascuel (1992). These bounds are applied to non-identically distributed dependent variables, thereby allowing us to generalise the diagnostics proposed by Jun et al. (2014).
Let X 1 , . . . , X N denote a sample from a dependent set of N random variables with non-identical distribution functions, F X 1 , . . . , F X N . We define a set of order statistics X (1) , . . . X (N ) for these random variables. Also, let F x r :N denote the distribution function for the r th-order statistic out of a sample of N dependent draws from F X 1 , . . . , F X N . Then, We partition the spatial domain into C groups of potentially unequal size, We denote the r thorder statistic from this set by S (r ) , where r = 1, . . . , CT M, and let F r denote the distribution function of the χ 2 m j distribution. It follows from above that,

Pivotal discrepancy measure goodness-of-fit test for Bayesian inference
We propose the following procedure for testing goodness-of-fit for Gaussian spatiotemporal models: 1. Partition the set of observed locations into C subsets, Q j of size w j , where j = 1, . . . , C, using K-means clustering. For each t = 1, . . . , T , let y t j , X t j , and R j denote the parts of Eq. 5 that correspond to subset Q j . 2. Generate posterior samples for θ, θ (1) , . . . , θ (M) , based on the entire observed dataset ( y 1 , . . . , y T ). 3. For every sampled parameter vector θ (i) , and each data subset y t j , for every t, calculate the pivotal statistic in Equation 7. 4. Collect all CT M statistics in an ordered set {S j ( y t j , θ (i) ) : j = 1, . . . , C, t = 1, . . . , T , i = 1, . . . , M}, and denote the kth-order statistic from this set by S * (k) . 5. Perform a two-sided goodness-of-fit test at significance level α by first specifying integers l and u such that 1 ≤ l < u ≤ CT M. Then determine t l and t u such that and are minimized. If either S * (l) < t l or S * (u) > t u , then the assumed model can be rejected in a two-sided test of size α. Jun et al. (2014) recommend that l and u be selected such that l = r l CT M and u = r u CT M, where 0 < r l < r u < 1, for example r l = 0.1 and r u = 0.9.

Simulation
A simulation experiment was performed to assess the ability of the goodness-of-fit test to detect misspecification of the covariance structure of a model. A total of 30 pairs of longitude and latitude values, s i = (s i1 , s i2 ), i = 1, . . . , n = 30, were sampled randomly from one of three subsets within the unit square. Within the first subset, Q 1 , 5 locations were generated uniformly from the lower left [0, 0.2] × [0, 0.2] portion of the unit square. In the second subset, Q 2 , 10 locations were uniformly sampled from the lower right [0.8, 1] × [0, 0.2] portion of the unit square. Finally, in subset Q 3 , 15 locations were uniformly sampled from the entire unit square. The motivation is that the fit of a spatial model can be best tested by comparing its fit in distinct regions (where its local smoothness properties can be evaluated), with its fit to points distributed throughout the domain (where its global features can be evaluated) as mentioned in Jun et al. (2014). Subsets Q 1 and Q 2 provided clusters of locations that allow for the assessment of local model fit, whereas subset Q 3 provides  Figure 1 shows the simulated locations and the corresponding subsets.
Three datasets were simulated using the spatio-temporal process defined in Sect. 2. The mean process, μ t , was set to zero, to allow for detection of model misspecification through the covariance structure only. Observed data { y t } were simulated for t = 1, . . . , 5 time points.
Three variants of the general Matèrn correlation function with closed form expressions were used to construct the covariance matrix σ 2 ω R. In the equation above, ν > 0 controls the smoothness of the realised random field, φ is a spatial scale parameter, K ν is a modified Bessel function of order ν and ||s i − s j || is the Euclidean distance between the locations (Banerjee et al. 2014).
The first variant of R(s i , s j ; ν, φ), is the closed form of the Matèrn correlation function, where the smoothness parameter, ν, is set to 0.5 and is also known as the exponential correlation function. The second variant, is the closed form of the Matèrn correlation function, where the smoothness parameter, ν → ∞, and is known as the Gaussian correlation function. The third variant, is a non-stationary form of the exponential correlation function given by Eq. 12, that allows the correlation between observations separated by distance ||s i − s j || to scale by their latitudes, s 2 .
The following parameters were chosen to simulate the data y t . The measurement variance (nugget variance) was set to σ 2 ε = 0.0001. A small value was chosen to focus on identifying an incorrect spatio-temporal covariance structure. Further, we set ρ = 0.7 and σ 2 ω = 1. Finally, we set φ = 0.2 in Eqs. 12 and 14, and φ = 0.8 in Eq.
Markov chain Monte Carlo (MCMC) was used to fit the model to the data and this was done in R using the package NIMBLE (NIMBLE Development Team 2017). Two chains of 100,000 iterations each were generated for the parameter vector θ = (β, ρ, φ, σ 2 ω , σ 2 ε ) T for each dataset. The first 90,000 iterations from each chain were discarded as warm-up, and the remaining draws were combined, resulting in a posterior sample of size M = 20,000. Convergence of the Markov chain was assessed using traceplots (not provided) and potential scale reduction factor (R) values. We took a value ofR > 1.1 to indicate lack of convergence. For each fitted model, pivotal quantities for every posterior sample were calculated inline with Eq. 7. We considered three cases of partitioning to assess the impact it has on testing goodness-of-fit. In the first case, the locations were not partitioned into subsets. Pivotal quantities for each fitted model S( y t ,θ (m) ) for t = 1, . . . , 5 and m = 1, . . . , 20,000 were calculated, combined and ordered. For the second case, the locations were partitioned into C = 3 subsets of w = 10. Pivotal quantities for each fitted model S( y t j ,θ (m) ) for t = 1, . . . , 5, j = 1, 2, 3, and m = 1, . . . , 20,000 were calculated, combined and ordered. Finally, the locations were partitioned into the subsets S 1 , S 2 , and S 3 , that were used to simulate the locations. Pivotal quantities for each fitted model S( y t j ,θ (m) ) for t = 1, . . . , 5, j = 1, 2, 3, and m = 1, . . . , 20,000 were calculated, combined and ordered. Table 1 displays the true parameter values that were used in the simulation of the data, as well as the summary statistics obtained from the posterior draws when the model was applied to the three data sets.R values for assessing convergence are also shown. The results show that the summary statistics for the model applied to data set 1 are the closest to the true values. The 95% credible intervals include the true values for β, φ, and σ 2 ω . Table 2 gives the 10th and 90th percentiles of the aggregated (over time and subset) ordered pivotal discrepancy measures for the model applied to each of the three data sets, in each of the three cases of subsetting. In order to confirm that the model provided a good fit, the quantities need to be within the interval given by the critical values that are calculated from the nominal χ 2 distributions assumed when they were calculated.
In the first case, when the locations were not partitioned for calculation of the pivotal quantities, it was found that the 10th and 90th percentiles of ordered pivotal discrepancy quantities were within the corresponding nominal percentiles of (12. 76, 56.33). This suggests that the model provided a good fit to each of the three simulated data sets. In the second case, when the locations were partitioned into 3 even subsets to calculate the pivotal quantities, it was found that the 10th and 90th percentiles of the aggregated ordered pivotal discrepancy quantities were within the corresponding nominal percentiles of (1.827 and 27.11) when the model was applied to data set 1. The 10th and 90th percentiles of the aggregated ordered pivotal discrepancy quantities were, however, outside the nominal percentiles when the model was applied to data sets 2 and 3. This suggests that the model provides a good fit only to data set 1. A similar result was observed for the final case, with the locations being partitioned into three unequal subsets. It was found that the 10th and 90th percentiles of the aggregated ordered pivotal discrepancy quantities were within the corresponding nominal percentiles of (0.4894 and 31.71) when the model was applied to data set 1. The 10th and 90th percentiles of the aggregated ordered pivotal discrepancy quantities were, however, outside the nominal percentiles when the model was applied to data sets 2 and 3. This suggests that the model provides a good fit only to data set 1.
We would have expected that the model only provide a good fit to data set 1, because the model used to generate that data set is the same as the one being fitted. This is correctly executed in the two cases of partitioning. In the first case, the model provided a good fit to each data set, because a lack of partitioning caused a decrease in power to detect the differences. This is highlighted in Figs. 2, 3 and 4. In those figures, the pivotal discrepancy quantities from each model applied to each data set in each case of partitioning are plotted as a density, and are overlaid with the nominal densities. We see for each data set that when no partitioning occurs, there is sufficient overlap of the pivotal quantities observed and the nominal densities to suggest the model provides a good fit. This is also the case for the partitioning scenarios for data set 1, but not the case for data sets 2 and 3.
For comparison we calculated the WAIC values when fitting the model with an exponential correlation function to all three datasets. The results in Table 2 show the smallest WAIC value for Dataset 3. This result is counterintuitive, however. Compared Table 1 True values and posterior summary statistics of parameters for simulated data  to the other two datasets, Dataset 3 has the worst model mis-specification and we would therefore expect it to have the highest WAIC value.

Hoki catch data from New Zealand sub-Antarctic survey
Research trawl surveys of the New Zealand sub-Antarctic region were carried out by the National Institute of Water and Atmospheric Research (NIWA) for the Ministry of Primary Industries, New Zealand (MPI). The survey design used was a stratified 2-phase adaptive design optimised to reduce variance in biomass estimates (Francis 1984). The primary focus of the survey design was abundance estimation of a particular fish species, Macruronus novaezelandiae, commonly known as hoki. The hoki survey has been run on an annual or biennial basis since 1991 (Bagley et al. 2013;Fisheries New Zealand 2019). For our application, we focused on the trawls that occurred between 2000 to 2008 in order to have a continuous annual time series. Catch weights of all species caught during the survey were recorded. As the survey design was based on hoki, we focus our application on modelling spatio-temporal hoki catch weights. Figure 5 illustrates the stratification used and shows that the largest catch weights for hoki were recorded for trawls near Puysegur Bank. To obtain repeated measurements at the same locations annually, catch weight locations within a stratum were gridded. The strata were gridded in such a way that within each grid there was at least one catch weight observation per year. The median longitude and latitude of all observations within a grid was taken as the grid centroid, g. The 38 grid centroids are shown in Fig. 5. It should be noted that not all strata were used in the grid construction. For strata 25-28, there were years during which trawls did not occur, and those strata were excluded from the final dataset. To obtain a single observation per grid per year, a weighted mean of all observations within a grid in a given year was calculated. Observations were weighted by distance from the grid center, with observations closer to the grid center given more weight. This method of weighting allowed observations located closer to the grid center to contribute more to the grid mean than those further away. In addition, the weighted mean depth of each trawl within a grid was assigned as the depth for the entire grid in the same fashion. Depth was used as a covariate for a selection of the models (shown below).
Three models were fitted to the gridded hoki catch weight data of the form given in Eqs. 2-4. We let y t = (y(g 1 , t), . . . , y(g 38 , t)) where y(g i , t) denoted the logtransformed weighted mean catch weight of hoki in grid g i for year t, and n = 38. The marginal distribution of y t given the parameters is, where μ t = 1 38 β for each model. The three models were distinguished by the correlation structure assumed for R. For model M1 we used an exponential correlation function, given by Eq. 12. For model M2 we used a Gaussian correlation function, given by Eq. 13. For model M3, we used the general Matèrn correlation function given in Eq. 11. For each model, the parameters were assumed a priori independent, and were assigned the non-informative prior distributions, β ∼ N(0, 100), σ 2 ε ∼ IG(2, 1), σ 2 ω ∼ IG(2, 1), φ ∼ U(0.01, 1000), ρ ∼ U(−1, 1). Further, for model M3, the smoothness parameter ν was assumed a priori independent of the other parameters and was assigned the non-informative prior distribution, ν ∼ U(0.01, 10). MCMC sampling of the joint posterior density was used to fit the three models to the gridded hoki data and this was done through R using the package NIMBLE (NIMBLE Development Team 2017). Two chains, each with 1,000,000 iterations, were generated for the parameter vector θ = (β, ρ, φ, σ 2 ω , σ 2 ε ) T for models M1 and M2, and θ = (β, ρ, φ, σ 2 ω , σ 2 ε , ν) T for model M3. The first 900,000 iterations from each chain were discarded as warm-up, and the remaining draws were combined, resulting in a posterior sample of size M = 200,000. Convergence of the Markov chain was assessed using traceplots (not provided) and potential scale reduction factor (R) values. Table 3 gives summary statistics for the posterior distributions andR values computed for each model. There appears to be agreement between the three models on the values of most of the parameters. For model M3, the smoothness parameter ν needed for the Matèrn correlation structure has a posterior median of 0.5156, which means it is close to being estimated as an exponential correlation structure. For each fitted model, pivotal quantities for every posterior sample were calculated using Eq. 7. The grid locations were partitioned into five subsets using the K-means clustering algorithm. The optimal number of subsets to use in the K-means algorithm was found using the elbow method (Kodinariya and Makwana 2013). Pivotal quantities for each fitted model S( y t j ,θ (m) ) for t = 1, . . . , 5, j = 1, . . . , 5, and m = 1, . . . , 200,000 were calculated, combined and ordered. The 10th and 90th percentiles of the nominal chi-square distribution were calculated according to Eqs. 9 and 10. They were found to be 0.133 and 28.6 respectively. The 10th and 90th percentiles of the ordered pivotal discrepancy measures for models M1, M2 and M3 are given in Table 4. It was found that for each model, the 10th  These are compared to the nominal 10th and 90th percentiles, 0.133 and 28.6 respectively percentile was higher than that of the nominal distribution and the 90th percentile was lower than that of the nominal distribution. As a result, it can be said that each model provided a good fit to the gridded hoki data. This is reflected in Fig. 6 which shows the posterior densities for each parameter for each of the three models, as well as the density of the ordered pivotal discrepancy measures. The latter are overlaid with the nominal distribution density. The posterior densities for each parameter are similar, with model M2 having the most different posterior parameter densities. It can be seen that the posterior densities for β, ρ, σ 2 ε , and σ 2 ω are similar in shape, and overlap between models. For the parameter φ, there is a difference between models M1 and M3, and M2. We conclude that Models M1 and M3 are very similar models, and that all three models provide a similar fit to the data, with the only differences being due to the correlation structure parameters. Looking at the densities of the observed pivotal discrepancy measures for each model, the conclusion that each model provides a good fit is motivated. There is sufficient overlap of the pivotal discrepancy measure densities and their nominal densities such that the test cannot detect a difference. As was done for the simulated data, we provide values of WAIC for the three models for comparison. The overall conclusions based on WAIC are similar to those reached using the pivotal discrepancy measures in that models M1 and M3 provide a similar fit, while model M2 differs from the other two models.

Discussion
From the simulation study, it is clear that partitioning is necessary, and furthermore, that the number of observations within a partition need not be constant. We found that the goodness-of-fit test based on pivotal discrepancy measures was unable to identify an incorrect model mis-specification when there was no partitioning. Further, when the number of observations within a partition was not constant, the distribution of the ordered pivotal discrepancy measures were wider, making it more difficult to reject the null hypothesis that the model provided a good fit. This indicates an increase in power.
The choice of how to partition the data should be considered carefully. In the simulation study in this paper, the subsets were chosen sensibly, in that we partitioned the observations according to the subsets that were used to generate the data. In practice, this will not necessarily be known, and an objective method should be developed. In the case study, we showed that using the K-means clustering algorithm is a suitable objective approach to partitioning. Jun et al. (2014) suggested that partitions of the dataset need neither be disjoint nor represent a complete partition. Thus other suitable clustering methods could be used. Density-based or fuzzy clustering methods are two of the alternatives to K-means clustering. Density-based algorithms create clusters based on the density of data points, with regions that have larger numbers of points considered as clusters. This would partition the spatial domain into regions of higher and lower spatial autocorrelation. Fuzzy clustering creates clusters that overlap, thus allowing points to belong to more than one cluster. Such an approach may be suitable where there is an even spread of points across the spatial domain. We made the assumption that the covariance matrix is separable over time and space, that is the overall spatial covariance structure is constant over time. Future research could investigate use of different clustering algorithms and use of our approach for a covariance matrix that is non-separable in space and time.
A final consideration is that of how to select the best model from competing models fit to the same data. The goodness-of-fit test based on pivotal discrepancy measures currently offers no way to select the best model, instead opting for a decision based test only. Jun et al. (2014) and Johnson (2007) talk briefly on calculating bounds on Bayesian p-values that may offer an appropriate route to model selection.
In conclusion, we have developed a general goodness-of-fit test for Bayesian spatiotemporal models using partitioning and pivotal discrepancy measures. The test was successful in simulation as well as application to New Zealand hoki data.