Skip to main content

Detecting the impact area of BP deepwater horizon oil discharge: an analysis by time varying coefficient logistic models and boosted trees

Abstract

The Deepwater Horizon oil discharge in the Gulf of Mexico is considered to be one of the worst environmental disasters to date. The spread of the oil spill and its consequences thereof had various environmental impacts. The National Oceanic and Atmospheric Administration (NOAA) in conjunction with the Environmental Protection Agency (EPA), the US Fish and Wildlife Service, and the American Statistical Association (ASA) have made available a few datasets containing information of the oil spill. In this paper, we analyzed four of these datasets in order to explore the use of applied statistics and machine learning methods to understand the spread of the oil spill. In particular, we analysed the “gliders, floats, boats” and “birds” data. The former contains various measurements on sea water such as salinity, temperature, spacial locations, depth and time. The latter contains information on the living conditions of birds, such as living status, oil conditions, locations and time. A varying-coefficients logistic regression was fitted to the birds data. The result indicated that the oil was spreading more quickly along the East–West direction. Analysis via boosted trees and logistic regression showed similar results based on the information provided by the above data.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Cook D (2013) The 2011 data expo of the American Statistical Association. Comput Stat (forthcoming)

  2. Friedman J (2000) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232

    Article  Google Scholar 

  3. Friedman J (2002a) Multiple additive regression trees. An interface with R from Salford systems

  4. Friedman J (2002b) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378

    Article  MATH  Google Scholar 

  5. Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London

    MATH  Google Scholar 

  6. Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics, 2nd edn. Springer, New York

    Book  Google Scholar 

  7. Hosmer D, Lemeshow S, Sturdivant R (2013) Applied logistic regression. Wiley series in probability and statistics. Wiley, Hoboken

    Book  Google Scholar 

  8. Leathwick J, Elith J, Hastie T (2006) Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions. Ecol Model 199(2):188–196

    Article  Google Scholar 

  9. Loecher M (2011) RgoogleMaps: overlays on Google map tiles in R. R package version 1.1.9.15

  10. R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

  11. Ridgeway G (2010) gbm: Generalized boosted regression models. R package version 1(6–3):1

    Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Tianxi Li.

Appendices

Appendices

Model selection by predictive power

Model selection in birds analysis

We selected the final model by comparing predictive power of the candidates. This is because the prediction bound was what we mainly leveraged in the analysis. In the logistic regression model, we calculated the predicted probability at each validation (test) point \(x\) that were selected in 3.4, denoted as

$$\begin{aligned} \hat{p}(x) = \hat{P}_x(y=1) \end{aligned}$$

for all \(x\) in the validation set. To evaluate the predictive power, we used 10-fold cross validation for the models. In the \(k\)th iteration, suppose the 10 % validation set is \(T_{(k)}\). Then according to the predicted probability, we could compute the log likelihood of \(T_{(k)}\) in each iteration as

$$\begin{aligned} log(\hat{P}(T_{(i)})) = \sum \limits _{x \in T_{(i)}}log(\hat{P}_x(y=y(x))). \end{aligned}$$

Taking the average of all the 10 iterations \(\sum _{k=1}^{10}log(\hat{P}(T_{(i)}))/10\) gave the measure of the predictive power of each model.

We mainly considered four models: (a) the model with constant time-varying coefficients (this is equivalent to model without using time); (b) the model with \(f\) being a linear function of \(t\) for Longitude’s splines and constant for Latitude’s splines (this was the one we finally used in the analysis); (c) the model with \(f\) being a linear function of \(t\) for Latitude’s splines and constant for Longitude’s splines and (d) the model with \(f\) being a linear function of \(t\) for both Latitude and Longitude. All of these four models had significant (or nearly significant) coefficients in all variables and passed the goodness of fit test. The average predictive log likelihood for the models are given in Table 7. It shows that model (b) gave the best predictive power as we claimed in the paper.

Table 7 10-fold cross-validation result

Model selection of boosted trees

We used 5-fold cross-validation to select the finally fitted boosting model. Figure 5 shows CV curve for this process. The average absolute error on the 1/5 held-out set in each iteration is shown as the red curve while the black one shows the training error. We chose the one with smallest CV error in the analysis.

Fig. 5
figure5

5-fold cross validation curve of the model fitting

Regression coefficients used in boosted trees

There were 1901 (location, time) points we used from the Physical Measurements in Sect. 3.3. Figure 6 shows the histogram of slope coefficients of all such regression models, \(\beta _{ij}\) and \(\alpha _{ij}\), of Temperature and Salinity, respectively.

Fig. 6
figure6

Histogram of the all slope coefficients \(\beta _{ij}\) and \(\alpha _{ij}\) of Temperature (left) and Salinity (right) regressions, respectively

Choose validation radius in \(B(x,r,t)\)

How to choose radius \(r\) in \(B(x,r,t)\) is a problem we need to solve for validation. As we used permutation test for the hypothesis, there was no direct way to do power analysis. Thus we chose \(r\) in an ad hoc way.

We know that there should be a trade-off when we use different \(r\). Larger \(r\) would lead to more validation points, but would include too many points that are not close enough for the bird samples, thus gives a bad point estimation via local averaging. On the other hand, smaller \(r\) tend to give less biased point estimation but we might have too few points for testing. Figure 7 shows the number of validation points as we increase \(r\). For smaller \(r\), the number of validation set would be very small and increased slowly. In such range of \(r\), the neighbor \(B(x,r,t)\) is too small for most of the bird sample to include estimation points. As \(r\) becomes larger, the number increases more rapidly, as we begin to include more and more remote points. There are two kinks on the curve. One is around 0.3, which would lead to 12 validation points. This is too far from enough for testing. Thus we chose the second kink range from 0.8 to 1.0, and took these as the candidates for validation radius.

Fig. 7
figure7

Number of validation points for different \(r\)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Li, T., Gao, C., Xu, M. et al. Detecting the impact area of BP deepwater horizon oil discharge: an analysis by time varying coefficient logistic models and boosted trees. Comput Stat 29, 141–157 (2014). https://doi.org/10.1007/s00180-013-0449-y

Download citation

Keywords

  • Varying-coefficients model
  • Logistic regression
  • Boosting trees
  • Oil impacts