Estimates of rate of spread
The dependent variable for the analysis, rate of spread of endemic bTB per km, was calculated from the estimated location of the endemic front in successive years (the methodology used to generate these data has been described in Brunton et al. 2015). A grid of 6.25 km2 hexagonal cells was applied to England and Wales, and a rate of spread was obtained for all hexagons through which the endemic front was calculated to have spread between September 2001 and August 2012 (n = 2148).
Variable selection
An extensive dataset of 193 variables was compiled. Variables were selected if there was evidence they were associated with bTB in published literature, and if data were available to describe the variables at the geographical level required for the analysis. The large number of potential co-variates was rationalised by reviewing summary statistics and performing bi-variable least-squares linear regression against the dependent variable, fitting predicted values and visually assessing the residuals. Analyses were performed in Stata 12 (Stata Corporation, College Station, TX, USA), and a significance level of p < 0.05 was used throughout.
For many of the variables, the residuals were not normally distributed so transformation of the data was explored using Box Cox regression. Many of the variables used may have been acting as potential proxies for other factors, and thus be correlated with each other. In an attempt to avoid multicollinearity, pairwise Pearson product-moment correlations of all variables were produced and strong correlations identified ([r] > 0.8). Where two or more variables were highly correlated, the one with the highest correlation with the dependent variable and/or the greater biological plausibility was retained. This resulted in a reduced list of 75 independent variables. A list of these variables including the sources of the data can be found in Table S1 in the Supplementary Information. These variables were grouped under six themes: animal-level factors, farm-level factors, bTB history and testing, landscape characteristics, wildlife and climate. Two variables that were considered as a priori confounders and not grouped under the six themes were the time period in which spread occurred (TpS), and the number of different genotypes of M. bovis present within the hexagon or its neighbouring six hexagons during the time period of spread. TpS, a categorical variable, was coded as an indicator variable. Where appropriate, continuous variables were scaled by centring around the mean, subtracting the lowest observable value from each observation if an intercept of zero was not meaningful, or divided by a suitable constant (e.g. 100) to improve the unit change represented by coefficients. Missing data were examined to determine if they occurred at random or if the fact that they were missing was linked to the actual missing data.
Linear regression with spatial autocorrelation term
To account for spatial autocorrelation (SAC) between variables an autocorrelation term, calculated from neighbouring rates of spread using a kernel with a bandwidth of 10 km, was included as an independent variable. This SAC approach which was calculated from the dependent variable was preferred to the residual autocorrelation (RAC) method which is derived from the combination of predictor variables, as these changed with each of the multiple models that were developed. Both methods are well described and compared by Crase et al. (2012).
Because of the large number of variables available for inclusion in the model, a hierarchical stepwise approach was taken using six thematic models (co-variates grouped by theme are described in the supplementary information (S2)). The a priori confounders were not included in the thematic models but were forced into the final model. Principal components analysis was used to identify the components that contributed the most variance to the data within each thematic variable set. This was used to guide the selection of variables for inclusion in the modelling, rather than to create new variables from the components. This ensured that the model parameters could be easily interpreted. The variables with the strongest loading in each key component (preferentially those in component 1) were systematically added to a multivariable linear regression model with robust standard errors to allow for the presence of heteroscedasticity. The variance inflation factor for each variable in the thematic model was calculated using the “estat vif” command in Stata to assess whether collinearity was present in the model, and highly collinear variables (with a VIF > 10) were considered for exclusion. Beginning with this initial thematic model, a backward stepwise approach based on Akaike’s Information Criterion (AIC) was used to select the best fitting thematic model, with the least important variables (based on p values) being removed first (as recommended by Burnham and Anderson 2002). Following the approach taken by Pioz et al. (2012), models differing by less than two AIC points were considered to receive identical support from the data. In these instances the more parsimonious model was selected, unless there was good reason a priori for retaining a specific variable. Transformed variables were used where they improved the fit of the model.
The six thematic models were then sequentially added into one overall model starting with the one with the smallest root mean squared error (RMSE). The F test was used to determine whether each thematic set of variables contributed significantly to the overall model. If a p value greater than 0.05 was obtained from the F test, all variables in that group were removed. Finally, using the same backward stepwise approach based on AIC as applied to the thematic models, the overall model was developed using the remaining variables from the thematic models and the a priori confounders. Significant variables at the level of p ≤ 0.05 were retained in the model. Variables which had been removed were added back into the model one at a time and reconsidered for inclusion if they generated a p value less than 0.05. The likelihood ratio test was then used to determine whether the model including the previously dropped variables gave a better fit to the data than one excluding the variables.
The residuals of the model were assessed using the Breusch-Pagan/Cook-Weisberg test for heteroscedasticity (Breusch and Pagan 1979). This generated a p value of less than 0.001 indicating that there was sufficient evidence to reject the null hypothesis that the residuals were homogeneous, and thus it was appropriate to use robust standard errors.
GWR
GWR analyses were performed in R version 3.0.1 (2013-05-16) utilising the GWModel package for all geographically weighted analyses. R packages ‘RColorBrewer’ and ‘foreign’ were also used to display and export the analysis outputs. A statistical significance level of p ≤ 0.05 was used in all analyses. The methodology loosely followed a workflow for the GWModel package outlined by Gollini et al. (2013) and can be split into three steps: Geographically Weighted (GW) summary statistics, GW-Principal Component Analysis and GW Regression analysis.
For this work we utilised the geographic weighting in its simplest form applying a simple moving subset of records to the analysis. For each of the 2148 hexagons we selected its closest 215 hexagons (i.e. 10 % of the total data set) to run a localised model. The size of this subset is termed the bandwidth of the GWR analysis. A bandwidth of 215 was shown to fit natural regions within our irregular shaped study area, as well as showing no significant change in outputs during the GW PCA analysis when compared with the automatically calculated bandwidth of 348 generated by the bw.gwpca function of GWModel.
GW summary statistics such as plotting regionalised standard deviations (and GW inter-quartile ranges) were used to highlight areas of high variability for variables, and identify where application of GW analysis may warrant close scrutiny. The GW Principal Components Analysis (PCA) identified those variables which accounted for the greatest variation within the 10 % subset at each location, applying PCA in a similar way as it was utilised in the linear regression analysis.
The variable selection performed prior to the linear regression determined the variables offered to the GW analysis (as described in Table 2). Further variable selection was conducted to eliminate variables where significant regional co-linearity occurred before selecting the final model using a stepwise selection approach based on the AIC.
The rationale for using the same variables offered to the linear regression analysis as a starting point for variable selection for the GW analysis was that the original complete covariate dataset collated for the project contained too many variables to model. It included a number of alternative measures of similar environmental or farm characteristics meaning that strong relationships were found between similar groups of predictors. Additionally, while the GW analysis was intended to be used to assess regional variation in the drivers of the rate of spread of endemic bTB, the ultimate goal of the project was to provide information that could be practically used to inform national bTB control policies. It made sense to start with the variables that were also used to model the rate of spread at a national scale, since these were likely to be of most importance to policy makers, and to see how their significance and relationship varied in different areas using the GW approach.
A number of criteria were used to deal with multicollinearity; primarily estimates of correlation, complimented by GW PCA analysis to identify which of the variables accounted for the majority of the variance within the predictor database. Where further variable selection was required, decisions were based upon biological plausibility and the suitability of variables as targets for policy development for practical interventions.
The variables that were most influential on the rate of spread according to this model were mapped to illustrate the geographical variation in key variables. The number of hexagons where a variable had the most influence on the rate of spread (as determined by the size of the p value) was calculated for each variable.
BRT
Boosted regression trees (BRT) modelling was used to perform a preliminary validation of the GWR outputs and predictor variable selection. This method is now widely used in spatial modelling. It is an iterative machine learning technique based on regression trees, that attempts to minimise a loss function (deviance) and does not assume a defined starting distribution (Elith and Graham 2008). As such it is suited to the use of a large number of covariates and a large number of observations, and is particularly effective at accounting for non-linear relationships with the response variable. The models were offered the same covariates as the final GWR model and implemented using the VECMAP® software suite. Three area wide models were run, each for a specific region where GWR showed a different and consistent relationship with the most important predictor covariates, defined as those with the largest number of hexagons in which they were the most important variable according to the size of the p value.