Improved inference for areal unit count data using graph-based optimisation

Spatial correlation in areal unit count data is typically modelled by a set of random effects that are assigned a conditional autoregressive (CAR) prior distribution. The spatial correlation structure implied by this model depends on a binary neighbourhood matrix, where two random effects are assumed to be partially autocorrelated if their areal units share a common border, and are conditionally independent otherwise. This paper proposes a novel graph-based optimisation algorithm for estimating the neighbourhood matrix from the data, by viewing the areal units as the vertices of the graph and the neighbour relations as the set of edges. The superiority of our methodology compared to using the border sharing rule is comprehensively evidenced by simulation, before the method is applied to a new respiratory disease surveillance study in the Greater Glasgow and Clyde Health board in Scotland between 2011 and 2017.


Introduction
Spatio-temporal count data relating to a set of K non-overlapping areal units for N consecutive time periods are prevalent in many fields, including epidemiology (Stoner et al., 2019) and social science (Bradley et al., 2016). The spatial correlation in these data is typically modelled by conditional autoregressive (CAR, Besag et al., 1991) models, which are specified as a prior distribution for a set of random effects within a hierarchical model specification. A large volume of research has extended this class of models to the spatiotemporal domain, including capturing: spatially correlated linear time trends (Bernardinelli et al., 1995); time period specific spatially correlated surfaces (Waller et al., 1997); and a temporally evolving spatial surface (Rushworth et al., 2014).
The spatial autocorrelation structure implied by these spatio-temporal CAR models depends on a K × K neighbourhood matrix W, which specifies which pairs of areal units are close together in space. A binary specification is typically adopted, where w kj = 1 if areal units (k, j) share a common border (are spatially close), w kj = 0 otherwise, and w kk = 0 ∀ k. CAR models model data in neighbouring areal units (k, j) (those with w kj = 1) as partially autocorrelated, while those relating to non-neighbouring areal units (k, j) (those with w kj = 0) are assumed to be conditionally independent. Thus while the spatial autocorrelation structure implied by these CAR models depends on W, the appropriateness of the choice of W for the data at hand or the sensitivity of the results to changing its specification are rarely acknowledged or assessed in the modelling. This is in sharp contrast to the related field of geostatistics for point level data, where variogram analysis is routinely used to identify an appropriate spatial correlation structure for the data, such as assessing the validity of isotropy and choosing an appropriate parametric autocovariance model. Furthermore, specifying W based on the simple border sharing rule is unlikely to provide an appropriate correlation structure for the count data under study, because spatial correlation is unlikely to be present universally throughout the study region. Instead, there will be pairs of neighbouring areal units that exhibit large differences between their data values, which can be driven by complex environmental and / or social process . Examples that illustrate this phenomenon include the fields of spatial clustering (Knorr-Held and Raßer, 2000) and boundary analysis (Lee and Mitchell, 2012), where identifying the locations of these step-changes is of primary interest.
Numerous approaches have been proposed for identifying spatial step-changes in areal unit count data, including specifying piecewise constant mean models (e.g. Knorr-Held and Raßer, 2000), and modelling elements in W that correspond to adjacent areal units as Bernoulli random variables (Ma et al., 2010). The latter approach allows one to estimate the spatial partial autocorrelation structure in the data, but it suffers from parameter identifiability problems because there are many more elements in W to estimate than there are areal units (data points). A partial solution is to propose a simple parametric model for the elements in W based on covariate information as in Lee and Mitchell (2012), but the estimation of W is then restricted by the parametric nature of the model. Therefore this paper proposes a novel graph-based optimisation algorithm for estimating an appropriate neighbourhood matrix W E for the data, which overcomes the two parameterisation issues highlighted above. The estimation of W E is based on an initial graph G, where the K areal units comprise the vertex-set V (G), and the edge-set E(G) is defined by W via E(G) = {(k, j)|w kj = 1} (so W is the adjacency matrix of G). The algorithm estimates whether each edge in the graph should be removed or not, with the mild restriction that every vertex must retain at least one incident edge. Our estimation algorithm has two stages, the first of which estimates W E from the data after covariate effects have been accounted for, which is akin to using variogram analysis on detrended geostatistical data to estimate an appropriate correlation structure. The second stage of our estimation algorithm fits a Poisson log-linear model with spatio-temporally correlated random effects to the count data based on W E , with inference in a Bayesian paradigm using integrated nested Laplace Approximations (INLA, Rue et al., 2009). Our general Poisson log-linear count data model with CAR structured random effects is outlined in Section 2, while our graph-based optimisation algorithm is outlined in Section 3. The superiority of our estimated W E compared with a traditional border sharing based neighbourhood matrix W is thoroughly evidenced by simulation in Section 4, while in Section 5 our approach is applied to a new respiratory disease surveillance study based in Greater Glasgow in Scotland.
Finally, Section 6 concludes the paper.
2 Spatio-temporal areal unit modelling for count data The study region is partitioned into K non-overlapping areal units such as Census Tracts, and data are available for each of these units for t = 1, . . . , N consecutive time periods.
The outcome variable Y kt is a spatio-temporally aggregated count of the number of events that occur in areal unit k during time period t, and is often accompanied by a vector of p covariates x kt and an expected count e kt . The latter allows for the fact that the areal units have different population sizes and age-sex demographics which thus affects the observed count, and e kt is typically included as an offset term when modelling these data. A general model for these data within a Bayesian inferential setting is given by Y kt ∼ Poisson(e kt θ kt ) for k = 1, . . . , K and t = 1, . . . , N, β j ∼ N(0, 100000) for j = 1, . . . , p.
Here θ kt denotes the risk or rate of the outcome variable Y kt relative to the expected count e kt , and the spatio-temporal variation in this risk (rate) is modelled by covariates {x kt } and random effects {ψ kt = φ kt + δ t }. The covariate regression parameters β = (β 1 , . . . , β p ) are assigned independent weakly informative zero-mean Gaussian prior distributions with a large variance, to ensure the data play the dominant role in estimating their value. An appropriate random effects structure depends on both the residual spatio-temporal structure in the data and the goal of the analysis, and here we utilise the model proposed by Waller et al. (1997) which decomposes this into separate spatial surfaces φ t = (φ 1t , . . . , φ Kt ) for each time period t and an overall temporal trend δ = (δ 1 , . . . , δ N ).
We adopt this structure because we believe that while the residual spatial surfaces will be similar each year they will not be identical. Thus assuming there is a single spatial structure common to all years as in Knorr-Held (2000) will be overly restrictive. We model the residual temporal trend by the first order autoregressive process: The prior distributions and their parameterisations are chosen to be weakly informative, and are the default specifications suggested by the INLA software (Rue et al., 2009) that we use for inference. We model the residual spatial trend for time period t using the conditional autoregressive prior proposed by Leroux et al. (2000) which is given by . . , φ kt ). Spatial autocorrelation is induced by the neighbourhood matrix W, and we adopt the commonly used binary border sharing definition described above. The level of spatial dependence is controlled globally by ρ t , with ρ t = 0 corresponding to spatial independence (as (3) simplifies to φ kt ∼ N(0, 1/τ t )), while if ρ t = 1 then (3) becomes the intrinsic CAR model proposed by Besag et al. (1991).
A weakly-informative normal prior on the logit scale is specified for the spatial dependence parameter ρ t , while a weakly informative log-gamma prior is specified for the log of the spatial precision τ t , again following the defaults suggested by the INLA software. The partial spatial autocorrelation structure implied by this model is given by where φ −kjt = φ t \ {φ kt , φ jt }. Thus W controls the partial spatial autocorrelation structure in φ t , because if w kj = 1 then (φ kt , φ jt ) are partially correlated with the strength of that correlation controlled globally for all pairs of neighbouring areas by ρ t , whereas if w kj = 0 then (φ kt , φ jt ) are conditionally independent. Thus while W is crucial to the model because it determines the spatial correlation structure in the data, its appropriateness for the data or the sensitivity of the results to changing its specification are rarely assessed.
Furthermore, specifying W via border sharing implies that all pairs of geographically adjacent areal units will have correlated random effects, which spatially smooths their values towards each other. However, the residual spatial surface in real data sets often exhibit areas of spatial smoothness separated by step changes, an example of which can be seen in Figure 1. Additionally, the identification of such step changes can be the goal of the analysis, such as in the areas of spatial clustering and boundary analysis highlighted earlier. Therefore in the next section we propose a novel graph-based optimisation algorithm for estimating a more appropriate neighbourhood matrix W E for the data that leads to improved inference.

Methodology
We propose a novel two-stage approach for estimating the model parameters Θ = (β, δ, σ, α, φ 1 , . . . , φ N , ρ 1 , . . . , ρ N , τ 1 , . . . , τ N ) and an appropriate neighbourhood matrix W E , which extends the currently used approach of naively fixing W based on the border sharing rule. In stage 1 we estimate W E using a graph-based optimisation algorithm, and in stage 2 we estimate the posterior distribution f (Θ|W E , Y) conditional on (Y, W E ). Our methodology thus brings areal unit modelling into line with standard practice in geostatistical modelling, which is to first estimate a trend model and then identify an appropriate correlation structure via residual analysis. In what follows W denotes the neighbourhood matrix constructed based on border sharing, while W E denotes our estimated matrix.

Stage 1 -Estimating W E
We estimate a single W E for the data, which requires the residual spatial structure to be similar for all time periods. We do this because we need multiple realisations of the spatial surface to estimate its correlation structure via W E well, which is evidenced by the simulation study in Section 4. Therefore, first we estimate a single residual spatial surfacẽ φ = (φ 1 , . . . , φ K ) for all time periods that is used to estimate W E .

Estimatingφ
In classical geostatistics with normally distributed data and mean model E[Y kt ] = x kt β, one examines the raw residualsφ kt = Y kt −x ktβ to identify an appropriate correlation structure, where initiallyβ is estimated assuming independent errors. The analogous approach for our count data model (1) This replaces the unknown E[Y kt ] with the observed data Y kt . The mean model parameters β are again estimated assuming independence, and {δ t } is removed as it is constant over space and hence does not impact on the estimation of the spatial correlation structure.
Then we estimate a common residual spatial surface by averaging over the N time periods, that isφ k = (1/N ) N t=1φ kt for all k.

Deriving an objective function to optimise
The CAR model (3) represents a graph G whose vertex-set V (G) is the set of K areal units, and whose edge-set is E(G) = {(k, j)|w kj = 1}, a subset of un-ordered pairs of elements of V (G). In graph theoretic terms G is the simple graph with adjacency matrix W, where W = (w kj ) is defined by the border sharing rule. Givenφ we estimate W E by searching for a suitable subgraph of G which maximises the value of an objective function J(φ). We base the objective function on the special case of (3) where ρ t = 1, which assumes that all pairs of neighbouring areal units (those with w kj = 1) have correlated random effects. This allows step changes in the spatial surface to be identified by removing edges from the graph (e.g. setting w E kj = 0), which would make the corresponding random effects conditionally independent as illustrated by (4). Therefore fixing ρ t = 1 in (3) and dropping the time subscript t asφ is an average over all time periods, we obtain the following objective function.
After removing unnecessary constants J(φ) becomes which depends on the precision parameter τ . Estimating τ by maximising (7) yields the , which when plugged into (7) yields the final objective function This function only depends on (φ, W), where the latter is the only thing to be maximised asφ is estimated as described above.

Graph-based optimisation
Let H be generic notation for any graph, then we use the following graph theoretic terminology in this section: (i) we write uv for the edge {u, v} with endpoints u and v; (ii) an and (vi) if H and H have the same vertex set we say that H is a spanning subgraph of H.
The graph G based on W has vertex-set V (G) and edge-set E(G), and we assume that edges e ∈ E(G) can be removed from the graph but that new edges cannot be added in.
This means that one can estimate w E kj = {0, 1} if w kj = 1, but if w kj = 0 then w E kj remains fixed at zero. Additionally, we assume that each area (vertex) must retain at least one edge in the graph, which corresponds to the constraint K j=1 w E kj > 0 for all k. This ensures that we do not divide by 0 in (8). Let f (H,φ) denote the value of J(φ) corresponding to W H , the adjacency matrix corresponding to the sub-graph H of G. Then the goal of our optimisation problem can be phrased as finding a spanning subgraphG of G, with minimum degree at least one, which maximises f (G,φ).
This graph optimisation problem is known to be NP-hard (Lee and Meeks (2020)), and so is extremely unlikely to admit an exact algorithm which will terminate in polynomial time on all possible inputs. Moreover, this intractability result holds even if we assume that the input graph G is planar; our input graph is necessarily planar because it is derived from the adjacencies of non-overlapping regions in the plane. In this work we therefore adopt a heuristic local search approach, which we describe in detail in the rest of this section.
It should be emphasised that this algorithm is not guaranteed to find the global optimal solution; we leave a more in-depth study of the existence or otherwise of algorithms with provable performance guarantees for future work.
A brute force optimisation strategy would consider all possible subsets of edges to delete (which is exponential in the number of edges in the original graph), and choose the one which maximises the objective function. However such a running-time is already infeasible in our relatively small example with 671 edges. To avoid this, we instead obtain an improved matrix W E by carrying out a sequence of local optimisation operations; this is much faster, but is not guaranteed to result in a globally optimal solution.
For our heuristic local optimisation, we consider the vertices of the graph in some fixed order, and attempt to optimise the set of edges incident with each vertex in turn. The reason that this does not necessarily find a global optimum is that the effect of deleting the edge uv depends on the set of edges incident at both u and v, so we have to choose a set of neighbours to retain for v without necessarily knowing which neighbours u will retain in the final solution. To deal with this, we decide whether or not to delete an edge by considering the difference between the contribution to the objective function from u (respectively v) from the best possible set of incident edges at u (respectively v) that does include the edge uv, and the best possible set that does not include this edge.
In order to apply this strategy, we need to express the objective function as a sum of contributions associated with each vertex of the graph, so that we can assess the impact of making local changes associated with an individual vertex. As a first step, we reformulate equation (8) in more graph theoretic notation. To do this, we set V = V (G) (observing that we use the same vertex set throughout), and note that |V | = K. For a vertex v corresponding to region k in the matrix, we setφ v =φ k . This gives To simplify notation, we will write ND H (v,φ) for the neighbourhood discrepancy defined It is now clear that, to maximise the right-hand side of (9), on the one-hand we would like to retain as many edges as possible to maximise the first term, but on the other hand we minimise the second term by deleting edges to decrease the neighbourhood discrepancy at each vertex. We can now associate with a given vertex v the following contribution, cont (v, H,φ), to the right-hand side of equation (9): .
The remaining barrier to using this expression to carry out locally optimal modifications is that the value of w∈V \{v} deg H (w) ND H (w,φ) depends on the entire graph, not just the edges incident with v, so we cannot compute the value of cont(v, H,φ) knowing only the neighbours of v in H. To deal with this, we define the adjusted contribution of v in H, with respect to a second graph H : , then the contribution at v is still increased by deleting e even when deletions are also carried out elsewhere in the graph to decrease the weighted sum of neighbourhood discrepancies.
These observations motivate our iterative approach. At the first step we consider the first vertex v and use the original graph G to identify a set of edges incident with v to delete (by considering the adjusted contribution with respect to G at both endpoints of the edges in question). We then delete these edges to obtain a new graph G and continue with the next vertex, this time considering the adjusted contribution with respect to G .
We continue in this way, returning to the start of the vertex list when we reach the end, until we complete a pass through all remaining feasible vertices (that is, those which still have more than one neighbour in the modified graph) without identifying any deletions that increase the objective function.
The algorithm is summarised in pseudocode as Algorithm 1 in the appendix. We note that the running-time depends exponentially on the maximum degree, but only linearly on the number of edges. It is not unreasonable to expect that the maximum degree will in practice be small compared with the total number of vertices or edges: it is unlikely that any one areal unit will border a very large number of other units (in our example the maximum degree is 22

Simulation study
This section presents a simulation study that compares the performance of model (1) -(3) based on a neighbourhood matrix that is: (i) constructed using the border sharing rule (denoted by W); or (ii) estimated using graph-based optimisation (denoted by W E ).

Data generation
The study region is the K = 257 Intermediate Zones (IZ) that make up the Greater Glasgow and Clyde Health Board (GGCHB) in Scotland, which is the setting for the motivating case study presented in Section 5. Count data are generated for this region from model (1) we also vary the sizes of the step changes we generate in the residual surface φ t .
Each simulated data set includes an independent (x 1 ) and a spatially autocorrelated (x 2 ) covariate, and the corresponding regression parameters are fixed at β 1 = β 2 = 0.05.
Both covariates are generated from zero-mean multivariate normal distributions with a standard deviation of 0.5 separately for each time period, with the independent covariate x 1 having the identity correlation matrix. The correlation matrix for x 2 is defined by the spatial exponential correlation matrix Σ = exp(−ξD), where D is a K ×K distance matrix between the centroids of the K IZs. The spatial range parameter ξ was chosen to ensure the covariate was visually spatially smooth, which was achieved by fixing ξ so that the mean correlation across all pairs of IZs was 0.25.
Temporal autocorrelation was induced into each simulated data set by a first order autoregressive process, with AR(1) coefficient α = 0.8. Similarly, spatial autocorrelation was induced via a multivariate normal distribution with a spatial exponential correlation matrix Σ = exp(−ξD), where ξ was chosen so that the mean pairwise correlation across all IZs was 0.15. To ensure that each time period had a similar but not identical residual spatial surface, φ t was generated by the sum φ t = φ + φ * t , with a common spatial surface φ for all time periods and time period specific deviations φ * t with a lower variance. The mean of φ t is denoted by µ, and this is the mechanism by which step changes are induced into φ t . Specifically, µ is piecewise constant with levels (−λ, 0, λ), where λ determines the size of the step changes. Here we consider values of λ = 0, 0.25, 0.5 in our simulation design, where λ = 0 corresponds to no step changes while λ = 0.5 corresponds to large step changes. These mean values (−λ, 0, λ) are assigned to the IZs to match the structure of the case study data as closely as possible, with for example IZs that exhibit comparatively high rates {θ kt } being assigned a mean value of λ. Example realisations of φ t for all 3 values of λ are presented in Section 1 of the supplementary material accompanying this paper.

Results
One hundred data sets are generated under each of 18 scenarios, which include all possible  Table 1, which displays their root mean square errors (RMSE) as well as the coverage probabilities and average widths of the associated 95% credible intervals.
The table shows 3 main findings, the first of which is that if you have purely spatial data (N = 1), then estimating W E leads to worse results than using the simple border sharing matrix W. This worse performance is highlighted by slightly larger RMSEs and reduced coverage probabilities below the nominal 95% levels. This worse performance occurs because the estimate ofφ used in the objective function (8) is only based on one set of spatial residuals from (5), and thus does not provide a good enough estimate of the unknown residual structure in the data. Secondly, if N > 1 but the risk surface does not exhibit step changes (λ = 0), then the RMSEs are broadly similarly between the two methods. However, the uncertainty quantification is better when using W E , with coverage probabilities closer to 95% when the disease is rare (e kt ∈ [10, 30]) and narrower intervals with similar coverage probabilities when the disease is common (e kt ∈ [150, 250]).
Finally, if one has spatio-temporal data (N > 1) that contain step changes (λ > 0), then using W E always produces better risk (rate) estimation compared with using W. This improved estimation includes reduced RMSEs by between 11.8% and 25.9%, and similar coverage probabilities obtained from credible intervals that are narrower by between 10.3% and 22.1%. Both these improvements occur because W E better represents the residual spatial structure in the data than W, such as allowing for the locations of step changes by setting the appropriate w E kj = 0. As the data contain multiple time periods the replication in the spatial surface leads to better estimates ofφ compared to when N = 1, which causes the improvements in inference. The reduced widths of the 95% credible intervals when using W E is because this matrix does not enforce correlation between neighbouring areas that exhibit a step change between them. This means that the variance 1/τ is not inflated to account for the spatial smoothing that is enforced between those areal units with very different data values.

Motivating study -respiratory ill health in Glasgow
Health care in Scotland is managed locally by 14 regional health boards, and here we focus on the Greater Glasgow and Clyde health board (GGCHB) because it exhibits some of the poorest health and widest health inequalities in western Europe (Walsh, D and McCartney, G and Collins, C and Taulbut, M and Batty, D, 2016). Specifically, the health board are interested in: (i) identifying areas that exhibit elevated risks of ill health allowing the appropriate targeting of health interventions; and (ii) quantifying whether inequalities in risk between rich and poor communities are widening or narrowing over time. We address these questions in the context of respiratory disease because it is one of the leading causes of death in Scotland (https://www.nrscotland.gov.uk/statistics-and-data).  the 7-year period (i.e. SM R k = 7 t=1 Y kt / 7 t=1 e kt ), which shows substantial variation with SMRs ranging between 0.57 and 2.33.
We have access to a number of covariates to explain this spatial pattern in disease risk, the most important of which is the Scottish Index of Multiple Deprivation (SIMD, http:// www.gov.scot/Topics/Statistics/SIMD). Deprivation or poverty is a key driving factor in spatial studies of population level ill health (NHS Health Scotland, 2016), in part because of its links to smoking. The SIMD is not computed each year using the same methodology, so here we use the index for 2016 as a purely spatial covariate. The SIMD is a composite index comprising indicators relating to access to services, crime, education, employment, health, housing, and income, and we consider each of these as possible covariate except for health as our outcome variable is health related. Furthermore, the crime indicator has one very large outlier (it is a city centre IZ containing lots of bars), so it is replaced by the average value from its neighbouring IZs. Finally, the income, employment, and education domains are all collinear, having pairwise correlations between 0.87 and 0.98.
We also consider a measure of fine particulate matter air pollution called PM 2.5 , because existing studies have shown that it is associated with respiratory ill health in Scotland (Lee et al., 2019). In common with the above study we utilise modelled concentrations from the Pollution Climate Mapping (PCM) model (https://uk-air.defra.gov. uk/data/pcm-data), because measured data are not available at the small area IZ scale.
The model produces annual average concentrations on a 1km 2 grid across the United Kingdom, which we spatially realign to our IZ scale by averaging.

Stage 1 -Estimating W E
We first fit a simple mean model (model (1) with no random effects) to estimate the residual spatial structure in the data via (5). Initially, we included the three collinear SIMD indicators (education, employment and income) in separate models, and the model with education had the lowest AIC and was thus retained. The remaining covariates crime, housing, access and PM 2.5 were then added to the model, and with the exception of housing they all exhibited significant effects at the 5% level and were used in the final mean model. This covariate only model exhibits substantial overdisperion with respect to the Poisson assumption (Var(Y kt ) = 3.20×E[Y kt ]), and the residuals from (5) exhibit substantial We then estimated W E as described in Section 3, based on the temporally averaged residuals and W constructed using the border sharing rule. This initial W splits the K IZs into two disconnected sub-graphs north and south of the river Clyde, while the estimated graph structure W E consists of one main sub-graph each side of the river with 4 further smaller disconnected sub-graphs. The graph based on W contains 671 edges compared to 332 for W E , a 50.5% reduction in the number of edges. The locations of the edges that have been removed are displayed as blue dots in Figure 2, which also displays the temporally averaged residuals across all 7 years. The figure shows that visually the residuals do not exhibit a spatially smooth surface, and that the removed edges (blue dots) mainly correspond to locations where there appear to be step changes. Although, note that removing an edge makes the corresponding data values conditionally and not marginally independent.

Stage 2 -Modelling the data
Model (1) -(3) is then fitted to the data separately using W and W E , and inference is based on integrated nested Laplace approximations using the full Laplace approximation.
A summary of the overall fit of each model via the deviance information criterion (DIC) and the effective number of independent parameters (p.d) is presented in Table 2, together with other key model parameters. The table shows that the model using W E fits the data better than that using W, with reductions in the DIC of around 213 and in the p.d by around 160. The latter suggests that W E provides a more parsimonious description of the data, which is due to an increase in the precisions (τ 1 , . . . , τ N ) (summarised by the range of the posterior medians in Table 2) when using W E . These increased precisions occur because unlike W, W E does not include edges between pairs of geographically adjacent IZs that exhibit large differences in their residuals, which reduces the amount of variation between φ kt and its spatially weighted mean from (3). This also increases the amount of spatial dependence in each spatial surface, which can be seen by the large increases in (ρ 1 , . . . , ρ N ) when using W E .

Covariate effects
Estimated relative risks (posterior medians) and 95% credible intervals for the covariates are also displayed in Table 2, where each relative risk relates to a one standard deviation increase in the covariates value. The table shows a significant relative risk of 1.02 for PM 2.5 when using W, but a much smaller insignificant association when using W E . As the simulation study showed that using W E provides better covariate effect estimates, this is likely to be the more reliable result. The composite education indicator quantifies populations with little or no education (including a standardised ratio of the number of working age people with no qualifications), and increasing this by one standard deviation leads to around a 38% increase in the risk of respiratory hospitalisation. The crime indicator shows that areas with higher crime rates exhibit slightly lower risks, while areas that have to travel further to access amenities (Access variable) also exhibit a slightly lower risk.

Disease surveillance
Our main aim is to use the modelling for disease surveillance, and identify areas that are most in need of an intervention to improve their health. Such areas of concern exhibit elevated risks and / or an increasing risk trend, and numerous metrics have been proposed for identifying such areas (see for example Kavanagh et al. (2012)). The most popular metrics for identifying high-risk areas are posterior exceedance probabilities (PEP) computed as π kt = P(θ kt > C|Y), the posterior probability that the risks {θ kt } exceed a certain threshold risk level C. The specification of C is somewhat arbitrary and chosen following discussions with public health experts, and here we choose C = 1.5 which represents a 50% elevated risk compared to the Scottish average. The PEP for 2017 is displayed in the top panel of Figure 3, which shows that most IZs exhibit either a very high (dark red) or a very low (dark blue) probability of exceeding this threshold risk level. The map highlights two types of exceedances, clusters of geographically adjacent IZs exhibiting elevated risks, and individual IZs that have much higher risks than their neighbours. The east end of Glasgow in the east of the health board is the largest and most well known high risk cluster, and is in part caused by a cycle of multi-generational poverty (NHS Health Scotland, 2016). In contrast, the single IZ in the north east of the health board near Kirkintilloch exhibits an elevated risk unlike its geographical neighbours, and would warrant further investigation by the health board into why it exhibits a very high PEP.
The other area of concern for the health board is IZs that exhibit increasing risk trends, and panel (B) of Figure 3 displays the temporal changes in the posterior median risks, {θ kN − θ k1 } for each IZ k. The figure highlights that most IZs exhibit some level of increase in respiratory hospitalisation risk over the 7-year study period, which agrees with the exploratory analysis of the SMR in Figure 1. However, a few areas exhibit decreasing risk trends such as Dalmarnock in the east of the health board (just above Rutherglen on the map), which in this case is due to the regeneration of the area following its use as the athletes village in the 2014 Glasgow Commonwealth games.

Health inequalities
Health inequalities measure the difference in disease risk between population sub-groups, and the World Health Organisation (World Health Organisation, 2013) define total inequality as the overall variation in disease risk, and social inequality as the variation in risk between different social groups. Here we quantify the size of these inequalities and how they are changing over time. We do this for total inequality by presenting the standard deviation, interquartile range and range in the estimated risk surfaces {θ kt } separately for each year in Table 3. The table shows Table 3. The table illustrates that the total inequalities described above are almost completely driven by socio-economic deprivation, because the lowest decile (most highly educated) has average risks ranging between 0.61 and 0.66, where as the highest decile (least highly educated) has average risks ranging between 1.86 and 2.08. These social inequalities have changed little over the 7-year study period, with almost no change in the mean risks in the 1st, 5th and 10th deciles of the education covariate over time.

Discussion
This paper has presented a novel graph-based optimisation algorithm for estimating the neighbourhood matrix when modelling spatio-temporal areal unit count data, and has provided software to allow others to utilise our methods. Our approach thus specifies an appropriate spatial correlation structure for the data at hand via W E , rather than naively specifying W using a simple geographical approach such as border sharing. The simulation study showed conclusive evidence that our proposed approach of using W E rather than W delivers improved inference in terms of both risk (rate) and covariate effect estimation for spatio-temporal data when the number of time periods is at least N = 5. Our approach estimates the residual spatial autocorrelation structure in the data using the residuals from a covariate only model, which is akin to applying variogram analysis to detrended geo-statistical data to identify an appropriate spatial correlation model. Thus we recommend that, as in geostatistics, standard practice in spatio-temporal areal unit modelling should involve estimating both the mean model and the residual spatial dependence structure, rather than specifying the latter using a convenient rule such as border sharing with little assessment of its suitability which is currently the norm in the field (e.g. Quick et al., 2017, Lee et al., 2019. The superiority of our approach was comprehensively illustrated for spatio-temporal data with and without step changes in the residual surface, although unsurprisingly the biggest improvements occur when such step changes are present. In contrast, our approach does not work well for purely spatial data (N = 1), because the residual spatial surfaceφ is not well estimated due to the random noise in the residuals (5) that stems from {Y kt }.
However, as N increases this random noise is reduced by averaging the residuals over time, leading to improved performance. Thus to apply this approach to purely spatial data we suggest estimatingφ from multiple sets of external data that have a similar residual spatial structure to the study data. Possible candidates in this regard are the same data but for earlier time periods, or data with a related response variable such as a different disease with a similar etiology.
Our motivating case study has illustrated the importance of obtaining improved estimation and uncertainty quantification of disease risk, because it will lead to improved accuracy of surveillance metrics such as PEPs that depend on the full posterior distribution. Our case study also illustrates that substantial and sustained inequalities in population-level disease risk remain in the GGCHB, despite extensive governmental focus in recent years on this key public health issue (e.g. NHS Health Scotland, 2016).
There is a wealth of future research directions for extending this work, the most obvious of which is to extend the class of data and models that our graph-based optimisa-tion approach can be used with. These include extending the methods away from count data to deal with Gaussian and binomial type responses, considering multivariate rather than spatio-temporal data structures, and using different spatio-temporal random effects structures to that considered here. Additionally, our motivating study has shown that similar levels of disease risk are more commonly observed between areas with similar levels of socio-economic deprivation rather than those that happen to be geographically close.
This suggests that one might want to additionally allow for correlation between areas with similar levels of socio-economic deprivation, perhaps via the introduction of a second neighbourhood matrix based on socio-economic rather than physical adjacency. This results in the data having a correlation structure based on a multilayer graph, and our optimisation approach would need to be extended to allow for this multilayer scenario.
Finally, there is significant scope to improve the performance of the graph-based optimisation algorithm used to estimate W E , as the current implementation makes use of a local search method that is not guaranteed to find the best possible matrix W E with respect to the objective function. The fact that the optimisation problem is NP-hard in general means that we are very unlikely to find an algorithm that is guaranteed to perform the optimsation exactly within a reasonable length of time for all possible inputs. Nevertheless, it may be possible to obtain an efficient approximation algorithm that achieves a guaranteed performance ratio (for example, computing a matrix for which the objective function is at most 5% worse than the best possible), or parameterised algorithms which have exponential running-time in the worst case but are guaranteed to perform much faster on inputs with specific structural properties. Further work is needed to establish the feasibility or otherwise of both approaches.

Introduction
This supplementary material contains the following sections. Section 1 presents example realisations of the random effects surfaces generated in the simulation study, while Section 2 presents additional simulation results relating to the estimation of covariate effects.
1 Example realisations of the simulated random effects surfaces Example realisations of φ t for all 3 values of λ (which controls the size of the step changes) are presented in Figure 1, to illustrate the spatial pattern in, and the magnitude of, these step changes. The figure shows that when λ = 0 the residual spatial surface appears visually smooth, whilst when λ = 0.5 large step changes are apparent. Obviously, setting λ = 0.25 results in an intermediate map with some small step changes.
2 Additional simulation study results -covariate effects The simulation study results for the covariate effects are summarised in Tables 1 (independent covariate) and 2 (spatially correlated covariate), and have the same format as Table   1 in the main paper. As before root mean square errors (RMSE) are presented, along with coverage probabilities and average widths of the associated 95% credible intervals. The tables show that covariate estimation exhibits the same 3 main findings as risk (rate) estimation presented in the main paper. Firstly, if one has spatial rather than spatio-temporal data, (N = 1) then using the estimated W E leads to worse covariate effect estimation than using a simpler geographical specification based on border sharing. This is true for both independent and spatially correlated covariates, with both exhibiting higher RMSE values and poorer coverage when using W E . However, if one has spatio-temporal data (N > 1) with step changes (λ > 0), then using W E results in lower RMSE and narrower credible intervals with more appropriate coverage probabilities (closer to 95%) compared to using W, making it the method of choice in such situations. In the absence of step changes (λ = 0) the two approaches perform similarly, with similar RMSE values for both covariate types and slightly better coverage probabilities when using W E to estimate the effects of spatially correlated covariates.