Effect of uneven sampling along an environmental gradient on transfer-function performance
- First Online:
- Cite this article as:
- Telford, R.J. & Birks, H.J.B. J Paleolimnol (2011) 46: 99. doi:10.1007/s10933-011-9523-z
- 770 Downloads
We investigate the effect that uneven sampling of the environmental gradient has on transfer-function performance using simulated community data. We find that cross-validated estimates of the root mean squared error of prediction can be strongly biased if the observations are very unevenly distributed along the environmental gradient. This biased occurs because species optima are more precisely known (and more analogues are available) in the part of the gradient with most observations, hence estimates are most precise here, and compensate for the less precise estimates in the less well sampled parts of the gradient. We find that weighted averaging and the modern analogue technique are more sensitive to this problem than maximum likelihood, and suggest a way to remove the bias via a segment-wise RMSEP procedure.
KeywordsTransfer function Root mean square error of prediction Uneven sampling Bias Weighted averaging Maximum likelihood Modern analogue technique Palaeoenvironmental reconstructions
Transfer functions for quantitative reconstructions of environmental variables based on the relationship between species and the environment in a modern training set have been immensely useful tools in the palaeo-sciences. Despite this utility, and the effort spent generating such training sets, there has been little work attempting to optimise the design of training sets. Here we consider the impact of uneven sampling along the environmental gradient.
ter Braak and Looman (1986) demonstrated that the efficiency of weighted averaging (WA) for estimating species’ optima and tolerances approaches that of Gaussian logit regression only when the environmental gradient is evenly sampled. Poorly estimated WA optima are unlikely to give the most reliable reconstructions, so we predict that training sets with evenly sampled gradients should perform better than those with unevenly sampled gradients, and that this difference should be larger with WA than maximum likelihood regression and calibration which uses Gaussian logit regression.
Ginn et al. (2007) investigated the effect of uneven sampling of the environmental gradient on transfer function performance by taking a large training set and dropping observations from the more densely sampled parts of the gradient until the remaining observations were approximately evenly distributed along the gradient. Surprisingly, they found that the cross-validation performance statistics from the full data set and the uniform data set were similar.
We adopt an alternative strategy, using simulated community data to develop training sets for unevenly sampled environmental gradients, and testing the performance of different transfer function procedures by both cross-validation and with an evenly sampled independent test set.
Minchin (1987) introduced a method for simulating realistic looking community patterns along environmental gradients using generalised beta distributions to represent species response curves. We implement his method in the statistical language R version 2.11.1 (R Development Core Team 2010) to generate species distributions and simulated assemblages along environmental gradients.
We generated species response curves on three orthogonal environmental gradients, which should approximate the dimensionality of many data sets. The gradient which we hope to reconstruct was 100 units long; two secondary nuisance variables were 60 units long. Species optima for thirty simulated species were uniformly distributed along the environmental gradients, with their maximum abundances drawn from a uniform distribution. Both shape parameters of the beta distributions were set to four, which produces symmetrical, near-Gaussian responses. The range or niche width of each species was set to 200 units. From these response curves, counts of 300 individuals were simulated and relative abundances calculated.
Six diatom-pH training sets, with between 96 and 241 observations, were chosen to represent different degrees of unevenness along the gradient of interest (Fig. 1). For each of these six diatom-pH training-sets, we generated two simulated training sets, one that matches the distribution of sites along the pH gradient, rescaled to fill the range 0–100, and a second that contains as many observations, evenly distributed over the range 0–100. We also generated an independent test-set with 100 evenly distributed observations. For all data sets, the secondary gradients were uniformly sampled.
The length of the first detrended correspondence axis of the simulated data is about 3 SD units of compositions turnover. This is comparable with many diatom training sets (Korsman and Birks 1996; ter Braak and Juggins 1993).
The unevenness of the sample was quantified as the standard deviation of the number of sites in each tenth of the gradient. When comparing training sets with a different number of sites, this value was divided by the total number of sites.
Transfer functions were generated for each training set using weighted averaging with inverse deshrinking (WA; Birks et al. 1990); maximum likelihood regression and calibration (ML; ter Braak and Looman 1986); and the modern analogue technique (MAT; Prell 1985) using squared chord distances. We calculated MAT with three analogues as in a trial run this gave the best performance with the independent test set. Transfer functions were run using the rioja library version 0.5-6 (Juggins 2009). The performance of the training sets was assessed by the root mean square error of prediction (RMSEP), the correlation between the predicted and “observed” environmental variables (r²), and the absolute value of the maximum bias. Maximum bias was calculated by dividing the environmental gradient into ten equally spaced segments, calculating the mean of the residual for each segment, and taking the largest of these ten values. Maximum bias quantifies the tendency for the model to over- or under-estimate somewhere along the gradient (ter Braak and Juggins 1993). Performance was measured for both bootstrap (with 500 bootstrap replicates) and leave-one-out cross-validation, and with the independent test set. The performance of the unevenly sampled simulated training-sets is expressed relative to the performance of the evenly sampled training-sets with the same number of observations. This standardises the results, so different training sets can be compared. The results presented are the mean of 50 trials with different simulated species configurations.
To investigate the relationship between bias in the transfer function performance statistics and the unevenness of the distribution of observations along the environmental gradient in more detail, we took an unevenly sampled gradient and redistributed observations from over-sampled parts of the gradient to under-sampled parts of the gradient. We did this with the NE USA pH distribution, dividing the gradient into ten equal segments and deleting (adding) observations in segments where there are excess (insufficient) observations. For each trial, we added/deleted between 0 and 100% of the excess/insufficiency in each segment in 5% increments. For each of the generated gradients, we simulated assemblage data and estimated the RMSEP both by cross-validation and for an independent test set. We report the ratio of these RMSEPs. Each trial was repeated ten times with different simulated species configurations.
Bootstrap cross-validation results are essentially identical to the LOO results and are not shown.
The effect of uneven sampling of the environmental gradient on species optima, predicted by ter Braak and Looman (1986), has been noted, for example, by Cameron et al. (1999) who ascribed differences between the WA optima for taxa in the SWAP and AL:PE training sets to differences in the distribution of sites in the training set. The impact of uneven sampling on performance statistics of transfer functions has not previously been fully explored. Ginn et al. (2007) found no benefit from even sampling on the cross-validation performance (for one transfer function method they found the r2 to be marginally higher with the evenly distributed training set, but the RMSEP was worse for all methods).
This result was contrary to what Ginn et al. (2007) expected, but is explicable following our results. The cross-validation RMSEP for their unevenly sampled training set is biased, being lower than would be expected for an evenly distributed independent test-set. The cross-validation RMSEP of their evenly sampled training set is unbiased, and therefore fails to outperform the unevenly sampled training set. Interpretation of the results of Ginn et al. (2007) is complicated by the different number of sites in the full and the evenly sampled training sets.
The leave-one-out cross-validation RMSEP is biased because the part of the gradient with most observations is the part where the species optima are most precisely known (or where most available potential analogues are) and hence the part of the gradient where estimates are most reliable. This compensates for the greater uncertainty in the few observations in the less densely sampled parts of the gradient, whereas the evenly sampled independent test set tests all parts of the gradient equally. This bias implies that RMSEP estimated for unequally sampled gradients may be over-optimistic.
The cross-validation r² is lower for the most unevenly sampled training sets than the evenly sampled training set because many of the observation are in a restricted part of the range so their variance is poorly explained by the predictions. The independent test set has a lower r² when predicted with the unevenly sampled training set because the species optima from the under-sampled part of the gradient are less well estimated.
ML is less affected by uneven sampling than WA. Following the finding of ter Braak and Looman (1986) that ML is more efficient at estimating optima along unevenly sampled gradients, this is not surprising. However, test set performance with ML is not consistently better than with WA. This may be because ML is sensitive to over-dispersion in the species data (Telford et al. unpublished data). MAT has the greatest problems with an unevenly sampled gradient. Observations in the poorly sampled parts of the gradient lack sufficient good analogues.
Performance statistics of six diatom-pH training sets for three different methods
The under-estimation of the RMSEP by cross-validation for unevenly sampled gradients will only be a problem when trying to reconstruct the under-sampled parts of the gradient. For reconstructions from the over-sampled parts of the gradient, RMSEP may even be pessimistic.
Cross-validation underestimates the uncertainty in unevenly sampled gradients, although some unevenness is possible before the bias in RMSEP becomes large. Maximum likelihood is the most robust method, the modern analogue technique it the least robust. Calculating the RMSEP by segments can correct for this bias.
This work was supported by Norwegian Research Council projects ARCTREC and PES. This is publication no. A319 from the Bjerknes Centre for Climate Research.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.