Study Site
Northern New England, USA (NNE; Maine, New Hampshire, Vermont), is a diverse socio-ecological system representing a range of landscapes, populations, and lakes. A history of settlement, farming, and timber industry segmented the landscape during the past two centuries. Natural habitat is broadly classified as Eastern Temperate Forest with level 3 ecoregions of Atlantic Maritime Highlands, Northeastern Coastal Zone, and Acadian Plains and Hills (Omernik 1987). The interior of NNE has a humid continental climate (Dfb: Köppen climate classification) with cold winters and seasonal patterns. NNE has a total area of 140,786 km2 with 9060km2 of lake surface area and a population of 3,283,562 million ranging from small villages to large cities. Over the past four decades, urban sprawl and impervious surfaces (i.e., conversion of natural land covers to man-made surfaces such as pavement) have increased the most in coastal and interior NNE relative to New England with much of the development surrounding lakes and lake communities (Torbick and Corbiere 2015a, b). Recent work has shown lake temperatures are increasing at a rate of 0.8 °C/decade in the region (Torbick et al. 2016), which will likely increase the frequency, duration, and magnitude of CHAB events. There are 4117 waterbodies greater than 8 ha and generally lake water quality is considered “good” in NNE with 82% categorized as oligotrophic and mesotrophic according to Landsat derived Trophic Status Index maps (Torbick et al. 2014).
Human Health Case Data
Our team has been building ALS case data in multiple regions, including NNE, for the past 10 years. The database used in this analysis included date of birth, sex, and residential longitude/latitude coordinates for cases collected between January 1999 and October 2009 similar to that used by Caller et al. (2013) and Torbick et al. (2014). Records from Dartmouth Hitchcock Medical Center (DHMC), the Muscular Dystrophy Association of Northern New England, regional clinics, and surveys were searched to identify cases of ALS diagnosed with dates. When possible, we confirmed accuracy of diagnosis, year of diagnosis, demographic history of patients identified by review of medical records, the Social Security Death Index, obituaries, and data supplemented from questionnaires. Nine cases only had a town name with no coordinates. These cases were assigned town centroid spatial location using the geocode function in the R package ggmap, which makes use of Google Maps (Kahle and Wickham 2013). This procedure did add spatial uncertainty for distances less than the town aggregation level for these few patients. Furthermore, the spatial extent for this database was restricted to the states of Vermont and New Hampshire and excluded the counties of Bennington (VT) and Cheshire, Hillsborough, Rockingham, and Strafford (NH), giving a total of 347 (in this selected region) ALS cases. This sub-region of NNE was selected since the ALS dataset being used in this analysis is suspected to underestimate the risk for the entire NNE region as it is likely that portions of the NNE population travel to other urban area medical centers (e.g., Boston) and thus are not within the clinic/hospital catchments of our dataset (Caller et al. 2015).
In Situ Lake Measurements
A field campaign to collect near simultaneous (in regard to satellite overpass) in situ measurements of phycocyanin concentration and other parameters across the region was carried out during the summers 2014, 2015, and 2016 (Fig. 1). The campaign was coordinated with government agencies (EPA, New Hampshire Department of Environmental Services, Maine Department of Environmental Protection, Vermont Department of Environmental Conservation) and university labs to ensure cross calibration and efficiency. A stratified lake sampling approach was executed that considered size, trophic status, depth, access, path row (location), watershed, and practical logistic factors (e.g., safety, drive time). Target satellite overpasses (path row) were coordinated with strategic lakes while considering local weather patterns (clouds, wind, humidity) during overpasses in an attempt to obtain a high number of diverse and robust samples under quality (clear sky) conditions. At each lake, local conditions were assessed and a sample location representative of an approximate 3 × 3 Landsat pixel array was pursued to allow for linkage between the in situ and satellite remote sensing. Medium and larger lakes (>200 ha) with spatial variability had multiple samples from different bays, “open” water, or noteworthy locations (e.g., near damn). A total of 305 unique observations from 79 different waterbodies across Maine, New Hampshire, and Vermont were obtained during July, August, and September. Conditions ranged from a small, 8.1-ha hypereutrophic pond (Showell Pond) to a large 127,000-ha lake (Champlain) with varying CHAB conditions across bays.
A well-calibrated YSI EXO-1 multi-parameter sonde was used to measure in situ cyanobacteria concentrations along with measurements of chlorophyll a, dissolved oxygen, fluorescent dissolved organic matter (FDOM) as a surrogate for chromophoric (colored) dissolved organic matter (CDOM), and a suite of other parameters (e.g., secchi depth, temperature, total dissolved solids, conductivity). Instrument measurements focused on 30–50 cm depth or near surface while periodic epilimnion, integrated tube, and vertical profiles were also collected. The “cyanobacteria” sensor measures phycocyanin pigments using in vivo fluorometry (IVF) in real time, detecting concentrations with a resolution of 1 cell/mL (0.1 relative fluorescence units/RFU). The instruments were calibrated and cross compared against extracted concentrations, standards, and other probes before sampling began each season, within season, and after season to correct for any potential drift. Since PC/cell can be variable depending on culture conditions, we also cross calibrate on PC pigment while also assessing against cultured Microcystis. Periodic integrated tube samples, plankton tows, and enzyme-linked immunosorbent assays (ELISA) were executed to gauge vertical profile structure, enumerate taxa, assess toxicity at a subset of lakes, and ensure cross calibration between in situ and probe instruments. The sonde was ported to a handheld integration device to simultaneously record Global Positioning System data and instrument observations. A StellarNet Inc. bluewave® spectroradiometer was used to collect periodic in situ radiometric measurements following best practices (e.g., Torbick and Becker 2009). The handheld hyperspectral device measures the 350–1150 nm range using a 16-bit digitizer and holographic diffraction grating (600 g/mm) CCD with a signal-to-noise ratio of 1000-to-1. The handheld hyperspectral measurements were used to help gauge conditions, spectral absorption characteristics, and qualitatively support preprocessing decisions.
Satellite Remote Sensing Mapping
A multi-year (see below) collection of in situ measurements was executed across 79 lakes of varying size and conditions, multiple path rows, and multiple time windows targeting Landsat 7 Enhanced Thematic Mapper Plus (ETM+) and Landsat 8 Operational Land Imager (OLI) overpasses. Landsat follows a Sun-synchronous orbit at an altitude of 705 km with a 16-day repeat window; each Landsat satellite (7 and 8) being offset to provide 8-day overpass repeats for a given foot print. These platforms capture observations in the visible (vis/0.45–69 μm), near-infrared (nir/0.75–0.90 μm), and shortwave-infrared (swir/1.55–1.75, 2.08–2.35 μm) at 30 m spatial resolution. Data were obtained as L1T from Earth Explorer with standard radiometric and geometric terrain corrections.
For inland lake mapping, a tradeoff is required considering number of samples, timing of overpass, atmosphere and weather conditions, and dynamics of CHABs. Longer temporal windows between in situ sampling and overpass provide more samples for modeling; however, potentially longer intervals present greater uncertainty concerning the stability of the conditions. This research followed precedence (i.e., Torbick et al. 2013; Lunetta et al. 2015) and had +/−2-day window. This resulted in 6 unique dates from 2014 (i.e., Days Of Year 237, 238, 245, 246, 251, 260), 4 unique dates from 2015, and 4 from 2016. Preprocessing routines for atmospheric correction built upon previous efforts. In summary, the atmospheric correction routines tested included the use of MODIS Aerosol Optical Depth (Level 3 MOD08_D3) to drive the Second Simulation of the Satellite Signal in the Solar Spectrum (6S) radiative transfer model (Vermote et al. 1997) to generate water-leaving radiance and surface reflectance measurements (Ledapsm). We also compared these outcomes to preprocessing routines that followed Vanhellemont and Ruddick (2014) and Vanhellemont and Ruddick (2015) to generate water-leaving radiance (rhow) and reflectance (rhoam) measurements by using a SWIR-based correction approach to adjust for Rayleigh and aerosol scattering. Correcting for atmosphere, while challenging, has advantages for transferability and more robust mapping models. For clouds and shadows, Function of Mask (Zhu and Woodcock 2012), Automated Cloud Classification Algorithm (Masek et al. 2006), and Band Quality Assessment (BQA) were applied using Ledaps and L8SR, as required for ETM+ and OLI. Any cloud or shadow pixels, along with Scan Line Corrector (SLC) gaps, were treated as no data.
A spatiotemporal database was built linking the Landsat overpasses to the in situ measurements. Sampling points were buffered to represent a 3 × 3 Landsat pixel array (visible bands; 3 × 3 pixels = 90 m × 90 m). These sampling units were then intersected with an inward buffered lake vector boundary to ensure no coastline or mixed pixel problems. The mean value for these areas was used as this helps capture potential variability of positioning error in either the georegistration or sample location. Strategic semi-analytical regression models were examined using variables shown to have spectral relationships with inherent optical properties in previous studies. Strategic independent variables (i.e., bands and ratios) were systematically added and removed while examining statistical performance and residuals. Withheld, out-of-sample adjusted R
2, significance values, root mean square error (RMSE), and Akaike Information Criterion (AIC) were used to assess performance. The result of the satellite mapping was a map of phycocyanin concentration (i.e., CHAB exposure) for all lakes greater than 8 ha in northern New England.
Statistical Modeling
There are several challenges in assessing the relationship between phycocyanin concentrations and ALS risk that go beyond the challenges in mapping it. For one, it is difficult to know exactly to what extent people are exposed to specific lakes. Thus, a common approach is to compare case residence locations to average exposures based on a certain proximity scale (Waller and Gotway 2004; Torbick et al. 2014). In a statistical framework, this comparison takes the form of a generalized linear regression model, which for our study is a Poisson regression that compares exposure levels to case counts adjusted for the density and demographic differences in the background population at risk (Diggle et al. 1998).
Ongoing deliberation within the science community exists on the model specifications that should be used for Poisson regressions of environmental public health data. On the one hand, a correct modeling approach needs to account for spatial dependence in the data; however, on the other hand, the accounting for spatial autocorrelation can be computationally prohibitive depending on the exact form and size of the data (Banerjee et al. 2004), and the addition of spatial random effects can create variance inflation in the exposure effect parameter, meaning significant exposures may appear insignificant (Reich et al. 2006; Hughes and Haran 2013; Hughes 2015). Furthermore, model outcomes depend on the shape and size of modeling units (Wall 2004; Waagepetersen 2004; Li et al. 2012), the choice of the background population dataset (Tatem et al. 2011), and the proximity scale chosen to average exposure estimates (Torbick et al. 2014). To address these challenges, we use Bayesian inference estimated using Integrated Nested Laplace Approximation (INLA) as implemented in the R package INLA (Rue et al. 2009) for 64 different Poisson log-linear models that vary each of the following components: the use of spatial random effects (two choices), the size of the geographic modeling units (two resolutions), the background population dataset (two choices), and the proximity scale for PC concentration exposure (eight scales). This modeling approach addresses all the potential complications mentioned above in a robust and transparent manner.
Half the models we executed contain no spatial random effects, and the other half include the spatial random effects proposed by Besag et al. (1991), which are a convolution of independent and spatially correlated random effects, where the spatially correlated random effects are an intrinsic conditional autoregression. The Poisson log-linear model with BYM random effects is a special case of a Log-Gaussian Cox Process (LGCP) (Møller et al. 1998) and is commonly used when data come in the form of case counts on areal units defined by administrative areas. One advantage of this specification is the savings in computational costs for Bayesian inference since a numerical approximation known as INLA can be used for very fast estimation of the posterior marginal distributions (Rue et al. 2009). However, since administrative areas are often not related to the disease in question and may vary greatly in size and shape, this model specification has been criticized for poor and/or unexpected results (Wall 2004; Li et al. 2012).
Since our health data are available as point locations representing the residence of a disease case, we can avoid the negative effects by defining custom areal units similar in size, shape, and related to the disease. A regular lattice is one such set of custom areal units used (Li et al. 2012; Illian et al. 2012; Diggle et al. 2013) for which Waagepetersen (2004) has shown that when pixel sizes tend to zero, the approximate posterior expectations of the LGCP converge to exact posterior expectations. This means that the regular lattice provides an approximate continuous specification of disease intensity, which has the potential to incorporate data on environmental risk factors that are available at high spatial resolutions (Diggle et al. 2013). Despite the approximate continuous specification, a specific discretization is subject to the ecological fallacy (Waller and Gotway 2004; Diggle et al. 2013). Thus, using the regular lattice as our geographic modeling units, we compare two resolutions, a 4-km resolution and an 8-km resolution, which were chosen to balance computational efficiency with disease process spatial variability. Specifically, the 8-km resolution was chosen to account for spatial uncertainty in some case locations where only a town name was available and because the median square area of these towns was 66.3 km2. The 4-km resolution was chosen following a procedure outlined by Diggle et al. (2013) where a preliminary estimate of the disease process spatial variability was obtained via minimum contrast.
Adjusting the case counts in the 4 and 8 km regular grids for the density and demographic differences in the background population at risk is done by calculating two sets of expected counts based on gridded population products representing the region’s population at the year 2000, one provided by the Socioeconomic Data and Applications Center (SEDAC) and the other a product of OakRidge National Laboratory (LandScan 2000). Expected counts are then used as a fixed offset parameter in the Poisson log-linear models (see log(E) in Fig. 2). Expected counts were indirectly and internally standardized, which means they represent the number of the cases expected in a specific pixel location assuming the population in that pixel location contracts ALS at the same rate as an internal standard population (Waller and Gotway 2004, chapter 2). Rates are age and sex specific following the age/sex classes defined in Noonan et al. (2005), and the internal standard population is the superpopulation containing all pixels within the study spatial extent. The age/sex specific rates were calculated as follows:
$$ {r}_i=\frac{\sum_{\mathrm{all}\ x}{O}_i(x)}{\sum_{\mathrm{all}\ x}{n}_i(x)}, $$
where i = 1 , 2 , … , 12 is the identifier for one of the 12 age/sex classes defined by Noonan et al. (2005), x represents a pixel location, O
i
(x) is the number of observed ALS cases in the pixel x having age/sex class i, and n
i
(x) is the population at risk in pixel x having age/sex class i. Following the calculation of the age/sex specific rates, the standardized expected counts for each pixel x are calculated as:
$$ E(x)=\sum_{\mathrm{all}\ i}{n}_i(x)\times {r}_i. $$
The n
i
(x) for the both r
i
and E(x) are either the populations counts from SEDAC or from Landscan 2000. A square pixel x is either sized 4 or 8 km.
For PC proximity scales, we consider a suite of lake area-weighted averages and maximums for a range of distances from the centroid of the modeling unit (4 or 8 km square pixel) and for watershed scales in which the modeling unit centroids fall. Table 1 gives the names and details for each scale used in our multi-scale analysis evaluating PC exposures.
Table 1 Proximity scales of phycocyanin concentration (μg/L) used in the ALS modeling study
The outcome of this modeling analysis is thus twofold. First, we seek to quantify the relationship between each of the spatial proximity scales of PC and ALS risk as well as compare differences in the scales. Second, we seek to quantify the impact of the fixed model components, i.e., the choice of background population, the choice of grid size, and the use of spatial random effects, on the estimate of the PC proximity metric’s effect on ALS risk and on the model’s fit as measured by the deviance information criterion (DIC) (Spiegelhalter et al. 2002) to ensure robust statistical analyses and address uncertainty.