We propose a novel two-stage approach for jointly estimating the parameters from (1)–(3) denoted by \(\varvec{\varTheta }=(\varvec{\beta },\varvec{\delta }, \sigma ^2, \alpha , \varvec{\phi }_1,\ldots ,\varvec{\phi }_N, \rho _1,\ldots ,\rho _N, \tau _1^{2},\ldots ,\tau _N^{2})\) and an appropriate neighbourhood matrix (matrices) for the data, which extends the current approach of using \({\mathbf {W}}\) constructed from the border sharing rule. We propose approaches for estimating both static (\({\mathbf {W}}_E\)) and time-varying (\({\mathbf {W}}_{E_{t}}\)) neighbourhood matrices, where for the former \({\mathbf {W}}_E\) is used in (3) for all time periods t while for the latter a separate \({\mathbf {W}}_{E_{t}}\) is used in (3) for each time period t. In stage 1 we estimate (\({\mathbf {W}}_{E}\), \({\mathbf {W}}_{E_{t}}\)) using a graph-based optimisation algorithm, and in stage 2 we estimate the posterior distribution
\(f(\varvec{\varTheta } |{\mathbf {W}}_{E}, {\mathbf {Y}})\) or \(f(\varvec{\varTheta } |{\mathbf {W}}_{E_1},\ldots ,{\mathbf {W}}_{E_N}, {\mathbf {Y}})\) conditional on the estimated neighbourhood matrices. Our methodology thus brings areal unit modelling into line with standard practice in geostatistical modelling, which is to first estimate a trend model and then identify an appropriate autocorrelation structure via residual analysis.
Stage 1: Estimating \({\mathbf {W}}_{E}\) or \(({\mathbf {W}}_{E_1},\ldots ,{\mathbf {W}}_{E_N})\)
Estimating the residual spatial structure
The random effects \(\varvec{\phi }_{t}\) model the residual variation in the data at time t after the effects of the covariates have been removed, so the first step to estimating \({\mathbf {W}}_{E}\) or \(({\mathbf {W}}_{E_1},\ldots ,{\mathbf {W}}_{E_N})\) is to estimate this residual structure in the data. The count data model (1) has expectation \({\mathbb {E}}[Y_{kt}]=e_{kt}\exp ({\mathbf {x}}_{kt}^{\top }\varvec{\beta } + \phi _{kt} + \delta _t)\), which can be re-arranged to give
$$\begin{aligned} \hat{\phi }_{kt}=\ln \left( \frac{{\mathbb {E}}[Y_{kt}]}{e_{kt}}\right) -{\mathbf {x}}_{kt}^{\top }\varvec{\beta } - \delta _t ~\approx ~\ln \left( \frac{Y_{kt}}{e_{kt}}\right) -{\mathbf {x}}_{kt}^{\top }\hat{\varvec{\beta }}. \end{aligned}$$
(5)
The latter approximation replaces the unknown \({\mathbb {E}}[Y_{kt}]\) with the observed data \(Y_{kt}\), and the temporal random effects \(\{\delta _t\}\) are removed as they are constant over space and hence do not impact on the estimation of the spatial correlation structure. The regression parameters \(\varvec{\beta }\) are estimated in this initial stage from a simpler model with no random effects (i.e. \(\{\phi _{kt}, \delta _t\}\) are removed from (1)) using maximum likelihood estimation, and are denoted by \(\hat{\varvec{\beta }}\). We consider two cases for how to use these residuals \(\{\phi _{kt}\}\) in our graph-based optimisation algorithm.
Case A: Static
\({\mathbf {W}}_{E}\)
If the residual spatial surfaces are similar over time, then estimating a single \({\mathbf {W}}_{E}\) common to all time periods is appropriate. In this case we estimate a single residual spatial surface by averaging the residuals over the N time periods, that is
$$\begin{aligned} {\tilde{\phi }}_k=(1/N)\sum _{t=1}^{N}\hat{\phi }_{kt}\quad k=1,\ldots ,K, \end{aligned}$$
(6)
and use this single residual spatial surface in our graph-based optimisation algorithm.
Case B: Temporally varying \(({\mathbf {W}}_{E_1},\ldots ,{\mathbf {W}}_{E_N})\) If the residual spatial surface evolves significantly over time, then an appropriate neighbourhood structure will also evolve over time. The simplest approach is to apply the graph-based optimisation algorithm to the residuals \((\hat{\phi }_{1t},\ldots , \hat{\phi }_{Kt})\) from (5) separately for each time period t, yielding a separate matrix \({\mathbf {W}}_{E_t}\) for each time period. However, as we show in the simulation study (Sect. 4.2) multiple realisations of the residual spatial surface are required to estimate \({\mathbf {W}}_{E_t}\) well, which makes this simple approach inappropriate. Therefore instead we estimate the residual spatial surface for time t using a \(2q+1\) time period moving average of the residuals from (5), that is
$$\begin{aligned} {\tilde{\phi }}_{k}=\frac{1}{2q+1}\sum _{r=t-q}^{t+q}\hat{\phi }_{kr}\quad k=1,\ldots ,K, \end{aligned}$$
(7)
with appropriate adjustments for the end time periods. For example, if \(q=1\) then for \(t=1\), \({\tilde{\phi }}_{k}=(1/3)\sum _{r=1}^{3}\hat{\phi }_{kr}\) and for \(t=N\), \({\tilde{\phi }}_{k}=(1/3)\sum _{r=N-2}^{N}\hat{\phi }_{kr}\). Thus in the static case A we estimate \({\mathbf {W}}_E\) from a set of spatial residuals \(\tilde{\varvec{\phi }}=({\tilde{\phi }}_1,\ldots , {\tilde{\phi }}_K)\) computed from all the data, while in the time-varying case B we estimate \({\mathbf {W}}_{E_t}\) separately for each time period t using a set of spatial residuals \(\tilde{\varvec{\phi }}=({\tilde{\phi }}_{1},\ldots , {\tilde{\phi }}_{K})\) that is computed separately for each year t.
Deriving an objective function to optimise
The CAR model (3) represents a graph G whose vertex-set V(G) is the set of K areal units, and whose edge-set is \(E(G)=\{(k,j)|w_{kj}=1\}\), a subset of un-ordered pairs of elements of V(G). In graph theoretic terms G is the simple graph with adjacency matrix \({\mathbf {W}}=(w_{kj})\) defined by the border sharing rule, and the graph for the Scotland respiratory disease study presented in Sect. 5 is shown in panel (a) of Fig. 1. Given \(\tilde{\varvec{\phi }}\) we estimate \({\mathbf {W}}_{E}\) or \({\mathbf {W}}_{E_t}\) by searching for a suitable subgraph of G that maximises the value of an objective function \(J(\tilde{\varvec{\phi }})\), and the sub-graph corresponding to \({\mathbf {W}}_{E}\) that was estimated for the Scotland study is presented in panel (b) of Fig. 1. This estimated graph has 47% fewer edges compared with the border sharing graph, and 90% of the vertices have had at least one edge removed.
We base the objective function on the natural log of the product of full conditional distributions \(f({\tilde{\phi }}_{k}|\tilde{\varvec{\phi }}_{-k})\) from (3) over all spatial units \(k=1,\ldots ,K\). We additionally enforce the restriction that \(\rho _t=1\), because from (4) this ensures strong spatial autocorrelation globally that can be altered locally by removing edges when estimating \({\mathbf {W}}_E\) or \({\mathbf {W}}_{E_t}\). These removed edges correspond to boundaries in the random effects surface, because if \(w_{E_{kj}}=0\) the corresponding random effects are conditionally independent. Thus dropping the subscript t for notational simplicity as we are dealing with a purely spatial objective function and fixing \(\rho =1\) in (3) as described above, we obtain the following objective function
$$\begin{aligned} J(\tilde{\varvec{\phi }})= & {} \ln \left[ \prod _{k=1}^{K}f({\tilde{\phi }}_{k}|\tilde{\varvec{\phi }}_{-k})\right] \nonumber \\\propto & {} -\frac{K}{2}\ln \left( \tau ^{2}\right) + \frac{1}{2}\sum _{k=1}^{K}\ln \left( \sum _{j=1}^{K}w_{kj}\right) \nonumber \\- & {} \frac{1}{2\tau ^{2}}\sum _{k=1}^{K}\left( \sum _{j=1}^{K}w_{kj}\right) \left( {\tilde{\phi }}_k - \frac{\sum _{r=1}^{K}w_{kr}{\tilde{\phi }}_{r}}{\sum _{r=1}^{K}w_{kr}}\right) ^{2}. \end{aligned}$$
(8)
We base the objective function on a product of full conditional distributions rather than the joint distribution for \(\tilde{\varvec{\phi }}\), because when \(\rho =1\) the latter is not a proper distribution as its precision matrix is singular. One could use the joint probability density function up to a proportionality constant but this leads to all edges being removed by the algorithm, and details are given in Sect. 2 of the supplementary information. As (8) depends on the unknown variance parameter \(\tau ^2\), we profile it out by maximising \(J(\tilde{\varvec{\phi }})\) with respect to \(\tau ^2\) which gives \(\hat{\tau }^2= \sum _{k=1}^{K}\left( \sum _{j=1}^{K}w_{kj}\right) \left( {\tilde{\phi }}_k - \frac{\sum _{r=1}^{K}w_{kr}{\tilde{\phi }}_{r}}{\sum _{r=1}^{K}w_{kr}}\right) ^{2} \Bigg / K\). This estimator is then plugged into (8) to yield the final objective function
$$\begin{aligned}&J(\tilde{\varvec{\phi }})\propto \frac{1}{2}\sum _{k=1}^{K}\ln \left( \sum _{j=1}^{K}w_{kj}\right) \nonumber \\&\quad -\frac{K}{2}\ln \left[ \sum _{k=1}^{K}\left( \sum _{j=1}^{K}w_{kj}\right) \left( {\tilde{\phi }}_k - \frac{\sum _{r=1}^{K}w_{kr}{\tilde{\phi }}_{r}}{\sum _{r=1}^{K}w_{kr}}\right) ^{2}\right] . \end{aligned}$$
(9)
Graph-based optimisation
Let H be generic notation for any graph, then we use the following graph theoretic terminology in this section: (i) we write uv for the edge \(\{u,v\}\) with endpoints u and v; (ii) an edge \(e \in E(H)\) is said to be incident with a vertex \(v \in V(H)\) if v is an endpoint of e; (iii) the number of edges in H incident with any single vertex v, written \(d_H(v)\), is called the degree of v in H; (iv) we write \(N_H(v)\) for the set \(\{u \in V(H) \setminus \{v\}: uv \in E(H)\}\) of neighbours of v in H; (v) a graph \(H'\) is a subgraph of H if \(V(H') \subseteq V(H)\) and \(E(H') \subseteq E(H)\); and (vi) if \(H'\) is a subgraph of H and these two graphs also have the same vertex set, we say that \(H'\) is a spanning subgraph of H.
The graph G based on \({\mathbf {W}}\) has vertex-set V(G) and edge-set E(G), and we assume that edges \(e \in E(G)\) can be removed from the graph but that new edges cannot be added in. This means that one can estimate \(w_{E_{kj}}=\{0,1\}\) if \(w_{kj}=1\), but if \(w_{kj}=0\) then \(w_{E_{kj}}\) remains fixed at zero. Additionally, we assume that each area (vertex) must retain at least one edge in the graph, which corresponds to the constraint \(\sum _{j=1}^{K}w_{E_{kj}}>0\) for all k. This ensures that we do not divide by 0 in (9). Let \(f(H,\tilde{\varvec{\phi }})\) denote the value of \(J(\tilde{\varvec{\phi }})\) corresponding to \({\mathbf {W}}_H\), the adjacency matrix corresponding to the sub-graph H of G. Then the goal of our optimisation problem can be phrased as finding a spanning subgraph \({\tilde{G}}\) of G, with minimum degree at least one, which maximises \(f({\tilde{G}}, \tilde{\varvec{\phi }})\).
This graph optimisation problem is known to be NP-hard (Enright et al. 2021), and so is extremely unlikely to admit an exact algorithm which will terminate in polynomial time on all possible inputs. Moreover, this intractability result holds even if we assume that the input graph G is planar; our input graph is necessarily planar because it is derived from the adjacencies of non-overlapping regions in the plane. In this work we therefore adopt a heuristic local search approach, which we emphasise is not guaranteed to find the global optimal solution. We leave a more in-depth study of the existence or otherwise of algorithms with provable performance guarantees for future work.
A brute force optimisation strategy would consider all possible subsets of edges to delete (which is exponential in the number of edges in the original graph), and choose the one which maximises the objective function. However such a running-time is already infeasible in our relatively small simulation study example which has 671 edges. To avoid this, we instead obtain an improved matrix \({\mathbf {W}}_{E}\) by carrying out a sequence of local optimisation operations, which is much faster.
While many heuristic graph searching algorithms exist, we were unable to identify any existing approaches which can be applied directly to this optimisation problem. The objective function (9) considered here has the unusual feature that it contains both a log-of-sums over vertices and a sum-of-logs: the way in which these interact makes it trivial to find examples where an optimal subgraph may no longer be optimal if a disconnected and isolated edge is added elsewhere in the graph. This subtlety rules out any exact or heuristic local search method. Additionally, the nature of the objective function rules out any heuristic that relies on a matrix representation of the objective function as well as other common techniques from operational research.
The starting point for our bespoke heuristic local optimisation is the following fairly standard approach. We consider the vertices of the graph in some fixed order, and attempt to optimise the set of edges incident with each vertex in turn. Since the effect of deleting the edge uv depends on the set of edges incident at both u and v, we have to decide whether or not to retain each edge incident with v without knowing precisely what effect this will have (as the neighbourhood of any neighbour u of v may not yet be fixed). We therefore decide whether or not to delete an edge by considering the difference between the contribution to the objective function from u (respectively v) from the best possible set of incident edges at u (respectively v) that does include the edge uv, and the best possible set that does not include this edge.
In order to apply this strategy, we need to express the objective function as a sum of contributions associated with each vertex of the graph, so that we can assess the impact of making local changes associated with an individual vertex; the main novelty of our approach lies in this derivation of a suitable bound on the contribution from each vertex that can be computed locally. As a first step, we reformulate (9) in more graph theoretic notation. To do this, we set \(V = V(G)\) (observing that we use the same vertex set throughout), and note that \(|V| = K\). For the vertex v corresponding to region k in the matrix, we set \({\tilde{\phi }}_v = {\tilde{\phi }}_k\). This gives
$$\begin{aligned} f(H,\tilde{\varvec{\phi }})\propto & {} \frac{1}{2}\sum _{v \in V} \ln \left( d_H(v)\right) \nonumber \\- & {} \frac{K}{2}\ln \left[ \sum _{v \in V} d_H(v) \left( {\tilde{\phi }}_v - \frac{\sum _{u \in N_H(v)} {\tilde{\phi }}_u}{d_H(v)}\right) ^{2}\right] .\nonumber \\ \end{aligned}$$
(10)
To simplify notation, we will write \({{\,\mathrm{ND}\,}}_H(v,\tilde{{\phi }})\) for the neighbourhood discrepancy defined as
$$\begin{aligned} \left( {\tilde{\phi }}_v - \frac{\sum _{u \in N_H(v,\tilde{{\phi }})} {\tilde{\phi }}_u}{d_H(v)}\right) ^{2}. \end{aligned}$$
It is now clear that, to maximise the right-hand side of (10), on the one-hand we would like to retain as many edges as possible to maximise the first term, but on the other hand we minimise the second term by deleting edges to decrease the neighbourhood discrepancy at each vertex. We can now associate with a given vertex v the following contribution, \({{\,\mathrm{cont}\,}}(v,H,\tilde{\varvec{\phi }})\), to the right-hand side of (10):
$$\begin{aligned}&{{\,\mathrm{cont}\,}}(v,H,\tilde{\varvec{\phi }}) \\&\quad := \frac{\ln (d_H(v))}{2} - \frac{K}{2} \ln \left[ \sum _{w \in V} d_H(w){{\,\mathrm{ND}\,}}_H(w,\tilde{{\phi }})\right] \\&\qquad + \frac{K}{2}\ln \left[ \sum _{w \in V \setminus \{v\}} d_H(w) {{\,\mathrm{ND}\,}}_H(w,\tilde{{\phi }})\right] \\&\quad = \frac{\ln (d_H(v))}{2}\\&\qquad - \frac{K}{2} \ln \left[ \sum _{w \in V \setminus \{v\}} d_H(w){{\,\mathrm{ND}\,}}_H(w,\tilde{{\phi }}) + d_H(v){{\,\mathrm{ND}\,}}_H(v,\tilde{{\phi }})\right] \\&\qquad + \frac{K}{2} \ln \left[ \sum _{w \in V \setminus \{v\}} d_H(w){{\,\mathrm{ND}\,}}_H(w,\tilde{{\phi }})\right] \\&\quad = \frac{\ln (d_H(v))}{2} - \frac{K}{2} \ln \left[ 1 + \frac{d_H(v){{\,\mathrm{ND}\,}}_H(v,\tilde{{\phi }})}{\sum _{w \in V \setminus \{v\}} d_H(w){{\,\mathrm{ND}\,}}_H(w,\tilde{{\phi }})}\right] . \end{aligned}$$
We then have that \(f(H,\tilde{\varvec{\phi }}) \propto \sum _{v \in V} {{\,\mathrm{cont}\,}}(v,H,\tilde{\varvec{\phi }})\). The remaining barrier to using this expression to carry out locally optimal modifications is that the value of \(\sum _{w \in V \setminus \{v\}} d_H(w) {{\,\mathrm{ND}\,}}_H(w,\tilde{{\phi }})\) depends on the entire graph, not just the edges incident with v, so we cannot compute the value of \({{\,\mathrm{cont}\,}}(v,H,\tilde{{\phi }})\) knowing only the neighbours of v in H. To deal with this, we define the adjusted contribution of v in H, with respect to a second graph \(H'\):
$$\begin{aligned}&{{{\,\mathrm{adjcont}\,}}_{H'}}(v,H,\tilde{\varvec{\phi }}) := \frac{\ln (d_H(v))}{2}\\&\qquad - \frac{K}{2} \ln \left[ 1 + \frac{d_H(v){{\,\mathrm{ND}\,}}_H(v,\tilde{{\phi }})}{\sum _{w \in V} d_{H'}(w){{\,\mathrm{ND}\,}}_{H'}(w,\tilde{{\phi }}) - d_H(v){{\,\mathrm{ND}\,}}_H(v,\tilde{{\phi }})}\right] . \end{aligned}$$
Observe that, if H is a spanning subgraph of \(H'\), we have \(\sum _{v \in V} \ln (d_H(v)) \le \sum _{v \in V} \ln (d_{H'}(v))\) and so, if \(f(H,\tilde{{\phi }}) > f(H',\tilde{{\phi }})\), we must have
$$\begin{aligned} \sum _{w \in V \setminus \{v\}}&d_H(w){{\,\mathrm{ND}\,}}_H(w,\tilde{{\phi }})\\ <&\sum _{w \in V} d_{H'}(w){{\,\mathrm{ND}\,}}_{H'}(w,\tilde{{\phi }}) - d_H(v){{\,\mathrm{ND}\,}}_H(v,\tilde{{\phi }}). \end{aligned}$$
This tells us that, if \({{\,\mathrm{adjcont}\,}}_H(v,H \setminus \{e\},\tilde{{\phi }})\) is strictly greater than \({{\,\mathrm{adjcont}\,}}_H(v,H,\tilde{{\phi }})\), then the contribution at v is still increased by deleting e even when deletions are also carried out elsewhere in the graph to decrease the weighted sum of neighbourhood discrepancies.
These observations motivate our iterative approach. At the first step we consider the first vertex v and use the original graph G to identify a set of edges incident with v to delete (by considering the best possible adjusted contribution with respect to G that can be achieved at both endpoints of the edges in question). We then delete these edges to obtain a new graph \(G'\) and continue with the next vertex, this time considering the adjusted contribution with respect to \(G'\). We continue in this way, returning to the start of the vertex list when we reach the end, until we complete a pass through all remaining feasible vertices (that is, those which still have more than one neighbour in the modified graph) without identifying any deletions that increase the objective function.
The algorithm is summarised in pseudocode as Algorithm 1 in the appendix. We note that the running-time depends exponentially on the maximum degree (as we consider all possible subsets of neighbours to retain at each vertex in order to identify the “best” possible neighbourhood), but only polynomially on the number of edges. It is not unreasonable to expect that the maximum degree will in practice be small compared with the total number of vertices or edges: it is unlikely that any one areal unit will border a very large number of other units (in our simulation study example the maximum degree is 22). Software to implement the optimisation are available in the R spatio-temporal modelling package CARBayesST (Lee et al. 2018). We discuss the performance of this algorithm (and several variations) on our example inputs in the appendix.
Stage 2: Estimating \(\varvec{\varTheta }\) given \({\mathbf {W}}_{E}\) or \(({\mathbf {W}}_{E_1},\ldots ,{\mathbf {W}}_{E_N})\)
We fit model (1)–(3) with \({\mathbf {W}}_{E}\) or \(({\mathbf {W}}_{E_1},\ldots ,{\mathbf {W}}_{E_N})\) replacing \({\mathbf {W}}\) in a Bayesian setting using integrated nested Laplace approximations (INLA, Rue et al. 2009). We use INLA due to its computational speed in fitting the models, but we could have used Markov chain Monte Carlo (MCMC) simulation methods, for example using the CARBayesST package.