Flint Drinking Water Crisis: A First Attempt to Model Geostatistically the Space-Time Distribution of Water Lead Levels

The drinking water contamination crisis in Flint, Michigan has attracted national attention since extreme levels of lead were recorded following a switch in water supply that resulted in water with high chloride and no corrosion inhibitor flowing through the aging Flint water distribution system. Since Flint returned to its original source of drinking water on October 16, 2015, the State has conducted eleven bi-weekly sampling rounds, resulting in the collection of 4,120 water samples at 819 “sentinel” sites. This chapter describes the first geostatistical analysis of these data and illustrates the multiple challenges associated with modeling the space-time distribution of water lead levels across the city. Issues include sampling bias and the large nugget effect and short range of spatial autocorrelation displayed by the semivariogram. Temporal trends were modeled using linear regression with service line material, house age, poverty level, and their interaction with census tracts as independent variables. Residuals were then interpolated using kriging with three types of non-separable space-time covariance models. Cross-validation demonstrated the limited benefit of accounting for secondary information in trend models and the poor quality of predictions at unsampled sites caused by substantial fluctuations over a few hundred meters. The main benefit is to fill gaps in sampled time series for which the generalized product-sum and sum-metric models outperformed the metric model that ignores the greater variation across space relative to time (zonal anisotropy). Future research should incorporate the large database assembled through voluntary sampling as close to 20,000 data, albeit collected under non-uniform conditions, are available at a much greater sampling density.


Introduction
including residential water tests, historical records, and city infrastructure data. Their analysis however ignored the spatial correlation among data and did not include a temporal component. A time trend analysis was conducted by Goovaerts (2017a) who used joinpoint regression to model time series of lead levels collected by the state-controlled and voluntary sampling programs. This analysis carried out at the city and ward levels still ignored the spatial correlation among data and did not provide any tax parcel-based prediction. A space-time analysis of these data should however provide important information to identify residences where high levels of lead are expected. It would also support any assessment of past and current lead exposures among the population at risk, particularly pregnant women and children.
Geostatistical techniques have been routinely used to analyze and map the spatial variability of soil and sediment lead concentrations Cattle et al. 2002;Solt et al. 2015), yet their application to lead in drinking water is far less common and mainly concerns groundwater quality (Siddique et al. 2012). A recent study (Wang et al. 2014) applied geographic information systems (GIS) and a hydraulic model of distribution systems to test the influences of pipe material, pipe age, water age, and other water quality parameters on lead/copper leaching in Raleigh (NC). In Symanski et al. (2004), mixed effect models were used to assess spatial fluctuations, temporal variability, and errors due to sampling and analysis for levels of disinfection by-products in water samples collected in households within the same distribution system. To the author's knowledge, the present study is however the first application of geostatistics to lead in drinking water within a distribution system. This chapter describes a new methodology to predict lead level in tap water, accounting for WLL measurements collected in neighboring houses, housing characteristics (e.g., age of the house or presence of lead pipes), and temporal trends (e.g., decline since return to pre-crisis source of drinking water). Linear regression was used to model temporal trends at sentinel sites, accounting for the composition of service line (SL), construction year, poverty level, and census tracts as covariates. Cross-validation analysis allowed one to assess the benefit of this approach and compare the results obtained using three different types of space-time covariance models. Both the cases of predicting unsampled times at monitored locations (i.e., filling gaps in time series) and making predictions at unsampled locations were investigated.

Datasets
4,150 WLL measurements recorded over the period 2/20/2016-11/20/2016 were downloaded from http://www.michigan.gov/flintwater (residential testing results). Data were then allocated to an individual tax parcel unit on the basis of their postal address. Data with incomplete address (two samples) or duplicates (e.g., samples taken from two different faucets on the same day in the same house) were discarded, leading to a total of 4,120 samples collected at 819 different sites; see Table 14.1. Because of their strongly positively skewed distribution (concentrations range from 0 to 5,986 μg/L) and large proportion of zero values (34.6%), data were transformed using the following formula Log 10 ðz + 1Þ.
Sentinel sites were initially selected from a pool of 1,951 volunteer sites identified during door-to-door water distribution; in particular it included all 156 sites with lead or lead combination service lines according to City records. Other sites were added according to several criteria: (i) spatial distribution to ensure coverage of all nine City wards, (ii) measurements of high blood levels (Hanna-Attisha et al. 2016), and (iii) environmental justice considerations (e.g. presence of houses with lead-based paint, minority population, and lower socio-economic households). This initial set evolved between sampling rounds as some residents stopped participating, while others asked to be included in the network (Goovaerts 2017b), which explains the fluctuation in the number of sampled sites during the first five rounds S1-S5: 607-621 (Table 14.1). Fewer sites (149-178) were then part of the "Extended Sentinel Site Program". Table 14.2 indicates that only 41 sites were sampled in all 11 rounds, while 80% of time series included five observations or less.
Each house selected to be part of the sentinel network was visited by a licensed plumber who classified the material of the service line coming into the home (i.e., customer-side service line) into six categories: lead, galvanized, copper, plastic, other, and unknown. Galvanized refers to iron pipe with a protective "galvanized" surface coating composed of zinc, lead, and cadmium, and therefore can be a long-term source of lead (Clark et al. 2015). The term "unknown" was used whenever the SL material could not be confirmed because, for example, the line was behind a wall or way back in a crawl space.
City records were the only source of service line data available for the majority of 56,039 tax parcels which were not part of the sentinel sampling program. These records are however inaccurate and lead to the over-identification of lead SLs, likely because old records were not updated as these lines were being replaced (Goovaerts 2017c). The same author found that construction year was a good predictor of service line material: galvanized lines were mostly found in pre-1934 houses, while the frequency of lead service lines (LSLs) peaked for houses built around World War II. This information was combined with field inspection data and city records to predict by indicator kriging the likelihood that a home has lead or galvanized SL (Goovaerts 2017c). Besides service lines, lead in drinking water mainly comes from lead-based solder and lead-containing plumbing fixtures (Lee et al. 1989;Cartier et al. 2011). Plumbing material is usually related to the installation year of a plumbing system, which can be approximated by the year of construction. For example, most faucets purchased prior to 1997 were made of brass or chrome-plated brass containing up to 8 percent lead (Rabin 2008). Construction year was retrieved from the 2016 Parcels GIS layer. The attribute "Year_built" was missing for 20,372 parcels and was estimated by ordinary kriging (Goovaerts 1997) with a mean absolute error of prediction of 6.43 years. Based on its relationship to water lead levels (Goovaerts 2017a), construction year was discretized into three classes : pre-1940, 1940-1959, and post-1959. Poor workmanship as well as lack of regular maintenance can also lead to more corrosion and leaching, and the presence of lead particulates, such as disintegrating brass or detaching pieces of old solder (Wang et al. 2014). Socio-economic status was here assessed using 2015 ACS (American Community Survey) 5-year estimates of the percentage of the block group population living in households where the income is less than or equal to twice the federal "poverty level".
There are many other variables known to influence lead in drinking water. For example, longer water age (i.e., water travel time between the treatment plant and home plumbing system) can decrease the effectiveness of corrosion control; increasing leaching and water lead levels (US EPA 2002;Wang et al. 2014). This information was however unavailable for this study.

Space-Time Kriging and Covariance Models
Let z(u α ;t) denote the water lead level recorded on time t at sentinel site α georeferenced by the geographical coordinates u α = (x α ,y α ) of the corresponding tax parcel centroid. Prediction of z-value at unsampled time t 0 and location u 0 was conducted using the following kriging estimator: n(t) is the number of observations recorded at time t, within the time window 2Δt, that were retained for estimation. The weights λ αt are solution of the following space-time (ST) kriging system: The parameter μ is a Lagrange multiplier accounting for the constraint on the weights. The term C u α − u β ; t − t ′ À Á is the ST covariance between any two observations recorded at locations u α and u β at times t and t ′ , respectively. Euclidian distances were used here since most lead in drinking water comes from premise plumbing materials and service lines instead of being transported through water mains (Del Toral et al. 2013;EET Inc. 2015).
One challenge associated with the application of ST kriging is the choice of a ST covariance model within the ever growing class of models (Montero et al. 2015). The following three non-separable ST covariance models were compared in the present study: • The generalized product-sum model (De Iaco et al. 2002): where k 1 , k 2 , and k 3 are non-negative (strictly positive for k 3 ) coefficients estimated from the sills of the spatial, temporal, and spatio-temporal semivariograms (De Cesare et al. 2002). • The metric model (Dimitrakopoulos and Luo 1994): where a normalized space-time distance measure is created by rescaling the spatial and temporal lags, h and τ, by the ranges of the spatial and temporal semivariograms, a s and a t (case of geometric anisotropy). • The sum-metric model (Heuvelink and Griffith 2010): This model combines characteristics of the two previous models: (i) sum of spatial and temporal covariances allowing for the presence of zonal anisotropies (i.e., semivariogram sills are not the same in all directions), and (ii) a metric ST model for the residual variability (geometric anisotropy).
Two other classes of non-separable ST covariance models, Cressie-Huang model (Cressie and Huang 1999) and Gneiting models (Gneiting 2002), were not considered because: (1) the fitting of these models needs a complex iterative parameter optimization technique (De Iaco 2010), whereas the three selected models can be fitted using straightforward techniques similar to those already used for spatial-only and temporal-only semivariograms, and (2) recent studies (Guo et al. 2015) indicate that these two more complex models provide similar fits to experimental ST semivariograms and comparable prediction accuracy as the product-sum model, confirming previous findings (De Iaco 2010).
The main difficulty in the practical implementation of the product-sum and sum-metric models is the inference of the sill of the ST semivariogram model, C st ð0Þ, which is most often estimated visually from the 3D plot of the experimental ST semivariogram γŝ t ðh, τÞ (e.g., De Cesare et al. 2002;Heuvelink and Griffith 2010). In order to make the fitting procedure more user-friendly, the space-time sill C st ð0Þ was here computed as the following weighted average of experimental space-time semivariogram values: where the weight w h, τ is the number of data pairs falling into the class of spatial and temporal lags ðh, τÞ. Only the classes where the ST semivariogram values exceed a critical sill g c , defined as the maximum of the spatial and temporal sills, were used.

Accounting for Secondary Information
Lead service lines are widely considered the main source of lead in drinking water (Lee et al. 1989;Clark et al. 2015). Another culprit is lead fixtures and pipes present within old houses (premises plumbing), and poverty can compound the problems through the lack of maintenance. Goovaerts (2017a) also found that temporal trends can vary greatly across the city. This secondary information was here incorporated in the definition of a stochastic trend model Mðu; tÞ, leading to the following decomposition of the space-time random function (RF) (Kyriakidis and Journel 1999): Zðu; tÞ = Mðu; tÞ + Rðu; tÞ ð 14:7Þ where Mðu; tÞ is a nonstationary spatiotemporal RF modeling the space-time distribution of the mean process, with E Mðu; tÞ ½ = mðu; tÞ and Rðu; tÞ is a zero mean stationary spatiotemporal RF modeling space-time fluctuations around Mðu; tÞ.

ð14:8Þ
This model naturally handles uneven spacing of repeated measurements within each time series, as well as their correlation which was modeled using a spherical variance-covariance structure. Once the trend model was fitted, regression residuals were interpolated using space-time simple kriging and the ST covariance models introduced in Sect. 14.2.2.

Cross-Validation
The accuracy of the predictive models created by the different approaches (e.g., three types of ST covariance models, univariate vs incorporation of secondary information) was assessed by cross-validation whereby each observation or time series (i.e., all data collected at the same site) was removed at a time and re-estimated using data collected at neighboring sentinel sites. The following performance criteria were then computed from n kriging estimates: • the mean error (ME) of prediction as: • the mean absolute error (MAE) of prediction as: • the mean square standardized residual (MSSR) as: where σ 2 K u a ; t ð Þ is the kriging variance.
A mean error close to zero indicates a lack of bias, while the mean absolute error should be as small as possible. If the actual estimation error is equal, on average, to the error predicted by the model, the MSSR statistic should be about one (Wackernagel 1998, p. 91).
One application of the predictive models is to prioritize any further sampling or intervention by ranking tax parcels from highly hazardous to less hazardous on the basis of kriging estimates. The ability of this ranking to identify successfully sites where WLL is greater or equal to the EPA action level of 15 μg/L was assessed using Receiver Operating Characteristics (ROC) curves which plot the probability of false positive versus the probability of detection (Swets 1988;Fawcett 2006;Goovaerts et al. 2016). The accuracy of the classification was quantified using the relative area under the ROC curve (AUC statistic), which ranges from 0 (worst case) to 1 (best case). The AUC is equivalent to the probability that the classifier will rank a randomly chosen positive instance (e.g., z c ≥ 15 μg ̸ L) higher than a randomly chosen negative instance (e.g., z c < 15 μg ̸ L). Figure 14.1a shows the location of all 819 sentinel sites within the nine wards in the city of Flint. Site-specific statistics such as number of observations and average log concentrations recorded for each time series, as well as composition of service line (GSL vs. LSL), were aggregated at the census tract level for better visualization. Geographical clusters of sentinel sites can be distinguished in several census tracts (e.g. border of wards 2 and 6, wards 7 and 9) which tend to be the tracts with the largest WLLs (Fig. 14.1c) and percentages of sampled LSLs (Fig. 14.1d). There is also a clear spatial trend with fewer lead service lines (e.g., none in Ward 1) and shorter time series (Fig. 14.1b) sampled in the Northern part of the city. Ward 5 includes the oldest neighborhood where GSLs are prevalent (Fig. 14.1e), while LSLs appear as small clusters, in particular in wards 6, 7 and 9 (Goovaerts 2017c).

Temporal Trend Modeling
Temporal trends for the three major types of service line were visualized by aggregating observations within non-overlapping 14-day windows, which corresponds to the average time interval between sampling rounds during the first phase (Round S) of the sentinel monitoring program (Table 14.1). Except for LSLs water lead levels do not appear to have declined over the 40-week sampling period; actually they seem to have slightly increased for GSLs (Fig. 14.2a). These results are however a direct artifact of the sampling strategy whereby 80% of sentinel sites were not sampled beyond week 16, while sampling continued at sites where the risk of exceeding the EPA action level of 15 μg/L was the greatest (Table 14.2).
After elimination of all sites where fewer than six observations were collected, the averaged time series display the expected decline (Fig. 14.2b). The impact is minimal for LSLs since most of these sites are considered at risk and were sampled during both the initial and extended sentinel sampling programs (Rounds S and X). The selection bias is stronger for copper and galvanized lines, which explains the larger water lead levels recorded during the first 16 weeks relative to LSLs.
This sampling bias complicated greatly the modeling of temporal trends by regression. Indeed using all the data would underestimate the weekly rate of decline of water lead levels, whereas subsetting the dataset (e.g., using only time series including more than five data points as in Fig. 14.2b) will result in overestimating the concentrations at a majority of sites. In addition, the time series length cannot be used as covariate in the model to allow its application at unmonitored locations. Two modeling strategies were considered in this chapter. First, because of its  (Fig. 14.1) census tract was used as covariate in the regression model (Eq. 14.8). The second more complicated approach was to allow the intercept to fluctuate among sentinel sites, even when located within the same tract; i.e., use a mixed model where the intercept is modeled as a random effect. The trade-off cost for this added flexibility was the need to estimate the intercept at unmonitored locations, which was accomplished using ordinary kriging. Despite providing a better fit than the first alternative, the mixed model did not lead to more accurate kriging estimates, hence only the first option is discussed hereafter.
All six interaction terms in the trend model (Eq. 14.8) were highly significant (α = 0.01). The correlation between predicted and observed WLL is however rather weak (r = 0.47), which illustrates the challenge of predicting spatial and temporal variations in lead for drinking water (Bailey and Russell 1981;Del Toral et al. 2013). While the output of the regression model provides a reasonable fit to the SL-specific time series computed using all the data (Fig. 14.2a), it underestimates water lead levels for LSL and GSL when using only time series including more than five data points (Fig. 14.2b).

Variography
Semivariograms helped quantifying the scale and magnitude of the space-time variability displayed by the maps and time series of Figs. 14.1 and 14.2. The spatial semivariogram (Fig. 14.3a) shows three nested scales of spatial variability: (1) a long range (2.35 km) caused by the neighborhood effect since houses in the same neighborhood tend to be built at the same time (i.e., similar plumbing system) and have similar water age, (2) a short range (200 m) corresponding to variability between adjacent houses, and (3) a nugget effect or discontinuity at the origin which represents the variability among samples taken within the same tax parcel (i.e. different apartments and/or measurement error for samples taken within the same residence). The substantial short-range variability (71% of total sill) likely reflects the heterogeneity in housing conditions (e.g., renovated houses) as well as the lack of uniformity of sampling conducted by homeowners since even with simple instructions it is difficult to ensure strict adherence to any sampling protocol (Del Toral et al. 2013). This interpretation is confirmed by the similar short-range variability displayed by the semivariogram of regression residuals (Fig. 14.3a, lower blue curve) since the regression model (Eq. 14.8) does not account for sampling characteristics. It is noteworthy that the longer range of 2.35 km is still fairly small relative to the size of the city (see legend of Fig. 14.1a), while the average separation distance between each sentinel site and the closest neighbor (293 m) exceeds the shortest range (200 m) that encapsulates 71% of the total spatial variability.
The temporal semivariogram (Fig. 14.3b) also displays three nested scales of variability although the longer range structure (110 days) represents here 53% of the total variability. Another difference with the spatial case is the overlap of temporal semivariograms for WLLs and regression residuals, illustrating the inability of the trend model (Eq. 14.8) to capture purely temporal changes. This result is in agreement with the small magnitude of changes displayed by the time series of predicted values in Fig. 14.2 (dashed line). Comparison of the total sills of spatial and temporal semivariograms (Fig. 14.3a-b) indicates that the variability observed across space is greater than the temporal variability. Such zonal anisotropy is in conflict with the assumption underlying the metric ST covariance model (Eq. 14.4). Figure 14.3c-d show the semivariograms computed using a normalized space-time distance (metric model). Because the spatial and temporal lags were rescaled using different constants for the WLL and residual semivariograms, these two curves are plotted separately. The vertical axis is however comparable and illustrates the smaller variability of residuals (i.e., lower sill for the semivariogram of Fig. 14.3d). Once again, both semivariograms display substantial short-range variability. The last two semivariograms (Fig. 14.3e-f) represent the metric space-time model that captures the residual variability in the sum-metric model (Eq. 14.5).

Cross-Validation Analysis
The semivariogram models of Fig. 14.3 were used to conduct a cross-validation analysis whereby one observation (LOO approach) or one time series (LTO approach) was removed at a time and re-estimated using data collected at neighboring sentinel sites. Based on a sensitivity analysis using ST ordinary kriging and MAE criterion, 48 observations with a maximum of three data points per site were retained for the estimation by univariate and residual ST kriging. Results obtained for predictions by the time trend model were also included as reference in Table 14.3.
The first three rows in Table 14.3 indicate that all algorithms give unbiased predictions (ME close to zero). As expected, the best prediction scores (i.e., lower MAE and higher AUC) are obtained when using data from the same time series (LOO approach) instead of relying solely on non-colocated data (LTO approach). Except for MSSR the product-sum model performs best, with the sum-metric model being a close second. The metric model underperforms the other two models because the combination of both spatial and temporal dimensions through a normalized space-time distance leads one to underestimate the correlation among observations of the same time series. In other words, the assumption underlying the ◀Fig. 14.3 Experimental semivariograms with the model fitted that were used to form the three types of ST covariance models (Eqs. 14.3-14.5) a spatial semivariogram (lower curve is for residuals), b temporal semivariogram, c metric semivariogram for WLLs, d metric semivariogram for regression residuals, e metric residual semivariogram (sum-metric model) for WLLs, f metric residual semivariogram for regression residuals metric model is incompatible with the zonal anisotropy detected on Fig. 14.3. Accounting for secondary information through residual kriging slightly improves the prediction relative to ST ordinary kriging; both kriging algorithms outperformed the trend model. These results however apply only to the narrow situation where exposure to lead in drinking water is reconstructed at the sole sentinel sites. For prediction at sites where no data was collected, LTO results indicate that differences between ST covariance models are much smaller as purely temporal correlations are not used in the kriging system. Nevertheless, the product-sum model still performs best. The LTO approach also emphasizes the benefit of using trend models that account for secondary information (i.e., larger differences between residual kriging and ordinary kriging). Yet, prediction performances actually deteriorate when kriged residuals are added to the trend model: the sole trend model gives better prediction than residual kriging. It is however noteworthy that the trend model was not cross-validated, hence the observation being predicted was used to create the model.  Because of the substantial short-scale spatial variability retaining increasingly distant data is expected to add more and more noise to the kriging estimate. This was investigated by changing the search strategy and selecting only sentinel sites located within a given distance of the site being predicted. If no data was located within the search radius, the kriged residual was zero and the residual kriging estimate was simply the value of the trend model. Figure 14.4 shows results of this sensitivity analysis conducted for the product-sum model over distances ranging from 50 m to 1 km. For the mean error of prediction the little benefit of residual kriging vanishes as soon as data beyond 100 m are used in the estimation (Fig. 14.4a), while this distance is 200 m for the area under the ROC curve ( Fig. 14.4b). Figure 14.4c indicates that 42% of sentinel sites have another sentinel site within 100 m, while this percentage is only 4.6% for tax parcels (Fig. 14.4c). In other words, there is little benefit in applying geostatistics to model the space-time distribution of WLL over the 56,039 tax parcels in Flint using the data collected at sentinel sites.

Conclusions
This chapter presented the first application of space-time geostatistics to lead levels recorded in drinking water of a public distribution system. The methodology was illustrated using 4,120 water samples that were collected at 819 "sentinel" sites over a 40-week period in the city of Flint. Despite a sizable database assembled by the State of Michigan, the geostatistical analysis was hampered by a temporal sampling bias and the existence of substantial variability over a few hundred meters. Unlike other countries such as Canada or France, sampling is not conducted by a trained technician in the US. Instead, homeowners are expected to collect water samples after a minimum of 6 h. of stagnation (e.g., overnight stagnation) following specific instructions (US EPA 2016), which can cause substantial variability among households. Other sources of fluctuation include heterogeneity in the plumbing system (e.g., renovation, installation of a new meter), location of sampled faucets (e.g., bathroom vs. kitchen), or water temperature (e.g., lead solubility increases with water temperature), to name a few.
In the present case-study, space-time kriging proved beneficial only in the situation where observations had been collected at the site being predicted; i.e., to fill the gaps in time series. The generalized product-sum and sum-metric space-time covariance models then outperformed the metric model that ignores the greater variation across space relative to time (zonal anisotropy). Sentinel sites represent however only 1.5% of tax parcels in the city of Flint. At unsampled sites the kriging prediction was no better than the temporal trend estimated by linear regression and it turned out to become less accurate if no data was collected within 100 meters. Although the regression model included site-specific characteristics, such as construction year and composition of service lines, it was unable to explain the short-range variability, leaving 78% of the total variance unaccounted for (R 2 = 22%).
In the future, several approaches will be investigated to tackle the impact of short-range variability on prediction. First, the data analyzed in this chapter represent less than 20% of the water samples available for the city of Flint. The majority of samples were collected by voluntary sampling whereby concerned citizens received a testing kit and conducted sampling on their own (Goovaerts 2017a, b). Despite the lack of periodic sampling in time and existence of temporal bias (e.g., houses with low lead levels were less likely to be tested again) the greater spatial coverage (i.e., more than 18% of tax parcels sampled) will reduce substantially the average distance between a tax parcel and the closest observation. However, spatial heterogeneity will likely still be present over short distances, leading one to question our ability to make prediction at the tax parcel level. More appropriate spatial supports for prediction could be census block groups which are statistical divisions of census tracts and are generally defined to contain between 600 and 3,000 people. The city of Flint includes 132 block groups and 40 census tracts. Such spatial aggregation or upscaling would be a way to filter between-household fluctuations which appears to be mainly noise. As more US cities are facing similar drinking water crisis, reliable techniques for sampling and modeling spatial and temporal changes in water lead levels will be sorely needed.  (2014) Using a GIS and GIS-assisted water quality model to analyze the deterministic factors for lead and copper corrosion in drinking water distribution systems. J Environ Eng 140:A4014004 Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.