Dealing with collinearity in behavioural and ecological data: model averaging and the problems of measurement error
Authors
- First Online:
- Received:
- Revised:
- Accepted:
DOI: 10.1007/s00265-010-1045-6
- Cite this article as:
- Freckleton, R.P. Behav Ecol Sociobiol (2011) 65: 91. doi:10.1007/s00265-010-1045-6
- 83 Citations
- 1.5k Views
Abstract
There has been a great deal of recent discussion of the practice of regression analysis (or more generally, linear modelling) in behaviour and ecology. In this paper, I wish to highlight two factors that have been under-considered, collinearity and measurement error in predictors, as well as to consider what happens when both exist at the same time. I examine what the consequences are for conventional regression analysis (ordinary least squares, OLS) as well as model averaging methods, typified by information theoretic approaches based around Akaike’s information criterion. Collinearity causes variance inflation of estimated slopes in OLS analysis, as is well known. In the presence of collinearity, model averaging reduces this variance for predictors with weak effects, but also can lead to parameter bias. When collinearity is strong or when all predictors have strong effects, model averaging relies heavily on the full model including all predictors and hence the results from this and OLS are essentially the same. I highlight that it is not safe to simply eliminate collinear variables without due consideration of their likely independent effects as this can lead to biases. Measurement error is also considered and I show that when collinearity exists, this can lead to extreme biases when predictors are collinear, have strong effects but differ in their degree of measurement error. I highlight techniques for dealing with and diagnosing these problems. These results reinforce that automated model selection techniques should not be relied on in the analysis of complex multivariable datasets.
Keywords
RegressionModel selectionInformation theoryIntroduction
In many respects, the ‘gold standard’ in hypothesis testing in behaviour, ecology and evolutionary biology is the randomised experiment, in which factors of interest are manipulated over a range of values. When examining the effects of different factors simultaneously, randomised experiments allow the effects of each of the variables examined to be isolated and measured individually through fully factorial designs (e.g. Grafen and Hails 2002; Ruxton and Colgrave 2002). The framework for statistical testing of data from designed experiments is extremely comprehensive and sophisticated (Sokal and Rohlf 1995).
In many situations, however, experimental approaches cannot be used and alternative methods are required. For instance, long-term monitoring (e.g. Leigh and Johnston 1994) and comparative analyses of data across groups of species (e.g. Harvey and Pagel 1991) are examples of commonly employed approaches to data gathering that do not usually use experimental methods. In general, observational approaches use data that are gathered passively without manipulation, and rely on natural variation in the variables of interest. If the natural variation in the system is large enough, then statistical analyses can be used to examine the effects of factors of interest. Statistical analyses (usually regression analysis or linear modelling) are performed as if this variation had been created through experimental manipulation with the aim of determining underlying causal relationships.
The downsides of observational approaches are twofold. First, confounding variables may be responsible for generating observed patterns, which may lead to incorrect conclusions. For example, spatial or temporal autocorrelation (Haining 1990; Chatfield 1996), or phylogenetic non-independence (Felsenstein 1988; Harvey and Pagel 1991) are well-known confounding factors in the analysis of behavioural and ecological data.
The second problem is that in complex datasets with a range of predictors, there is frequent correlation between the predictors. For instance, in climate data from temperate regions it is often found that hot summer weather is accompanied by dry conditions, and hence rainfall is low when temperatures are high. Consequently, it is difficult to disentangle the effects of temperature and rainfall using data that are gathered under normal conditions. In regression analysis, collinearity of this sort among predictors can generate problems of analysis and interpretation. Thus, a variable of interest may correlate strongly with several predictors; however, if these predictors are correlated, the independent effects of each may be hard to disentangle (e.g. see Freckleton et al. 1998 for an example in a regression context). This problem is one that is difficult to address and to effectively deal with, and that I discuss in this paper.
One of the most straightforward ways to deal with collinear variables is to use a data reduction method such as principal components or factor analysis (Draper and Smith 1998; Quinn and Keough 2002). For instance, in the hypothetical example, above summer temperature and rainfall is the product of prevailing weather conditions and thus a single summary variable (e.g. the first principal component) may accurately represent the data. However, often the correlated variables may be expected to have independent effects: as a hypothetical example, plant growth increases with temperature, but decreases with decreasing rainfall. However, if temperature and rainfall are negatively correlated as outlined above, a single axis would not allow the countervailing effects of these two variables to be disentangled. An alternative approach uses diagnostics and adjustments based on propensity scores (Rosenbaum and Rubin 1983) to account for imbalances in observational studies when the assignment of observations amongst groups is non-random, potentially yielding biases akin to those resulting from collinearity in regression. For fitting regression models, other approaches exist such as ridge regression and various shrinkage techniques (Draper and Smith 1998; see below for model averaging based on Akaike’s information criterion (AIC)-information theoretic (IT) which is a shrinkage method). These improve parameter estimates, or estimates of variance to account for collinearity when fitting single models. However, it is in fact the case (although not always widely appreciated) that least squares estimates of statistical model parameters are robust to moderate and even high levels of collinearity (Draper and Smith 1998). Estimates of parameter variance may, however, be very sensitive affecting hypothesis tests.
The problem of collinearity is particularly an issue in model selection (e.g. Grafen and Hails 2002). In model selection, the aim is to find a model with the best fit to data with not too many parameters. However, if predictors are correlated, models with different predictors may have similar fits to data. This is a problem that is particularly important when using automated techniques, causing them to identify suboptimal models as the ‘best’.
For problems in model analysis in behaviour and ecology, interest has focussed on model averaging (Burnham and Anderson 1998, 2002; Rushton et al. 2004; Link and Barker 2006; Johnson and Omland 2004). Model averaging recognises that there are two forms of uncertainty in modelling. The first is the uncertainty in parameter estimates, for example measured by standard errors and confidence intervals for parameters. The second source of uncertainty is in the model: usually the ‘true’ model is unknown, and there is a probability that each candidate model is the ‘true’ model. This probability can be measured and incorporated as a source of uncertainty. There are many ways to perform model averaging; a recent comprehensive review is that of Claeskens and Hjort (2008). In ecology, the methods of Burnham and Anderson (1998, 2002) based on AIC have become widely used. Model averaging uses information on the fit of all models to data, not just the best-fitting model. The contribution of each model to the final analysis depends on its relative fit, with better fitting models playing a greater role than poor-fitting ones. However, the consequences of collinearity for model averaging methods are not clear.
Least squares methods should yield unbiased parameter estimates even in the presence of moderate amounts of collinearity. This is because least squares estimates can be shown to be the best linear unbiased estimators (BLUE) for a given model. It is possible to show with simulations (e.g. Freckleton 2002, below) that this means that ordinary multiple regression is relatively insensitive to collinearity, probably more than most researchers imagine. However, this result is dependent on the predictors being measured without error. If measurement error in predictors exists, biases result in parameter estimates with the bias increasing with the amount of measurement error (Carroll et al. 2006). If different levels of measurement error exist in collinear variables, this will affect the outcome. One might expect differences in measurement errors to be common: for instance, temperature will be proportionately much more accurately measured using a thermometer than rainfall is using a rain gauge.
In this paper, I wish to highlight two issues. First, that ordinary least squares (OLS) and model-averaging methods can perform well in the presence of even quite high levels of collinearity, with model-averaging methods out-performing OLS approaches under certain conditions. However, high or different levels of collinearity between predictors, or different measurement errors between these alter this conclusion, and yield problems for both methods: this is the real problem of collinearity in ecological data analysis. If measurement errors are appreciable, then these should be quantified and the possible effects can be modelled.
Methods
Linear model
The effects of the two predictors are assumed to be linear and are modelled by the parameters b_{1} and b_{2}. Without losing generality I set a = 0, but included this as an estimated parameter in the statistical models below (because it would be unknown in reality). e_{y} is an error term, where each observation has an associated error e_{i}. This is assumed to be normally distributed with zero mean and variance \( {\left( {1 - {b_1}} \right)^2} + {\left( {1 - {b_2}} \right)^2} \). This is a standard linear model, and the mathematical theory for such models is well developed. In the simulations described below, I define β = b_{2}/b_{1}, i.e. β is the ratio of the effect of x_{2} relative to that of x_{1}.
Collinearity
The asterisk denotes that the value of this predictor is unstandardised. x_{2} was standardised before entering into Eq. 1. r is the correlation between the two predictors, and e_{x} is normally distributed with zero mean and variance (1–r)^{2}.
Observation model
Model fitting
Two approaches to fitting models were contrasted. First a linear model including both predictors was fitted using OLS and estimated parameters were recorded from each simulation.
Second, an information–theoretic approach based on Akaike’s information criterion was used (Burnham and Anderson 1998, 2002; Burnham et al. 2010). I used the methods described by Burnham and Anderson (2002). The approach compares the fits of a suite of candidate models using AIC. The absolute size of the AIC is unimportant; instead, the difference in AIC values between models indicates the relative support for the different models. In order to compare models, I calculated an “Akaike weight”, w_{i} for each model. For a set of R models, the w_{i} sum to 1 and have a probabilistic interpretation: of these models, w_{i} is the probability that model i would be selected as the best fitting model if the data were collected again under identical circumstances.
Because the w_{i} are probabilities, it is possible to sum these for models containing given variables (Burnham and Anderson 2002). For instance, if one considers some variable k, one can calculate the sum of the Akaike weights of all the models including k, and this is the probability that of the variables considered, variable k would be in the best approximating model were the data collected again under identical circumstances.
- Model 1:
y = a
- Model 2:
\( y = a + {b_1}{x_1} \)
- Model 3:
\( y = a + {b_2}{x_2} \)
- Model 4:
\( y = a + {b_1}{x_1} + {b_2}{x_2} \)
In this example, the set of models was minimised; however in practice, the set may be expanded to consider interactions between variables, or more complex models containing additional predictors. A specific problem with doing this is that the problems of collinearity will be magnified by adding models including interactions between collinear variables, as the interaction term would necessarily also be highly collinear.
Simulations
I conducted a series of simulations to demonstrate the effects of collinearity and measurement error in the predictors on parameter estimates and their sampling variances. To examine the effect of collinearity, I set b_{1} at a value of 0.5, a moderate effect. I set n = 100 as this is a value typical of that used in many comparative analyses. The value of β was set at zero (no effect of predictor 2) or 1.5 (a stronger effect of predictor 2 than predictor 1). The correlation between the predictors, r, was then varied between 0 and 1. At each parameter combination, I conducted 10,000 simulations and in each case recorded the fitted parameter estimates, generated using the two methods described above.
To illustrate the impacts of measurement error, I repeated the above simulations but adding measurement error to predictor 1. I simulated error in this way as the aim was to demonstrate how the effect of unquantified error in one predictor could lead to mistaken inferences about the effects of other predictors. The measurement error standard deviation of e_{1} was set at 0.5.
Example data
In order to illustrate how model averaging and OLS methods might perform in a dataset containing collinearity, I performed an analysis of the “foxes” dataset from Grafen and Hails (2002). This is apparently a dataset on factors that influence overwinter survival in foxes, which is a function of average individual weight. Thirty groups were studied and data recorded on the size of the group, the availability of food, the area of each territory, as well as the average weight of foxes in each group. Two of the potential predictors, group size and food availability are strongly correlated with each other (r = 0.88, P < 0.0001), so that collinearity is an issue in this dataset. Models were fit and model-averaged parameter estimates generated as described above. Model-averaged estimates of parameter variances were generated using the formula in Burnham and Anderson (2002) and Anderson (2008).
Results
Collinearity and model averaging versus OLS
Figure 1d–f shows what happens to the parameter estimates using the AIC-IT approach in this example. For zero and moderate levels of correlation, the parameter estimates are unbiased and the sampling distributions can be narrower than for those obtained using the OLS approach (Fig. 1d, e). However, when the correlation between the predictors is strong, the parameter estimates become biased, with the effect of the weak predictor being over-estimated and that of the strong predictor under-estimated (Fig. 1f)
The differences between the performance of the methods is relatively straightforward to understand. When the effect of x_{1} is strong and that of x_{2} is weak, model averaging tends to give higher weight to model 2 and low weights to models 1, 3 and 4. The estimates of b_{2} in models 3 and 4 are downweighted (using the weighting scheme in Eq. 4) and consequently the estimates for this parameter are “shrunk” (see below for a discussion of shrinkage estimators) towards zero. The sampling variance for b_{2} is thus lowered relative to the other models, particularly relative to model 4 which is the model fitted by OLS. As the correlation between the predictors increases, a problem arises because it becomes increasingly difficult to distinguish between models 2 and 3. This results in bias in the two parameter estimates (Figs 1d and 2b) as weight is given to the incorrect model (model 3).
When the effects of both predictors are strong, a high weight is given to model 4, and the others given low weights. This means that the model used for estimation by the AIC-IT method is basically the same model that is fitted by OLS.
A final small, but potentially important point, to emerge from Fig. 2 concerns the apparently aberrant points in Fig. 2c, d. The point in question is the estimate of the effect of x_{1} when the two predictors are perfectly correlated (r = 1). At this point because the two predictors are indistinguishable, only one parameter can be estimated. The estimate from OLS is ∼1.25 (the sum of b_{1} and b_{2}), for the AIC-IT method it is ∼0.8. In the case of OLS, it is easy to see what is happening: because x_{1} and x_{2} are the same, the effect of x_{1} is estimated to be the sum of the effects of the two variables. In general, it is easy to show that the slope estimated for a single predictor x_{1} used singly in preference to co-fitting with a second correlated predictor x_{2} is b_{1} + rb_{2}. The bad consequences of this for practical regression analysis are discussed below.
Example data: model-averaging versus OLS
Analysis of a dataset containing multicollinearity using AIC-IT methods
Model | Intercept | se | Area | se | Food | se | Group Size | se | cAIC | w_{i} |
---|---|---|---|---|---|---|---|---|---|---|
0 | 4.59 | 0.12 | – | – | – | – | – | – | 64.86 | 0.00 |
1 | 4.41 | 0.49 | – | – | 0.25 | 0.68 | – | – | 67.19 | 0.00 |
2 | 5.05 | 0.37 | – | – | – | – | –0.12 | – | 65.46 | 0.00 |
3 | 4.31 | 0.42 | 0.10 | 0.14 | – | – | – | – | 66.82 | 0.00 |
4 | 3.98 | 0.39 | – | – | 4.51 | 1.11 | –0.66 | 0.15 | 53.77 | 0.46 |
5 | 4.44 | 0.49 | 0.22 | 0.29 | –0.70 | 1.42 | – | – | 69.22 | <0.001 |
6 | 4.49 | 0.35 | 0.66 | 0.19 | – | – | –0.47 | 0.13 | 56.78 | 0.10 |
7 | 4.00 | 0.38 | 0.35 | 0.22 | 3.20 | 1.35 | –0.69 | 0.15 | 53.86 | 0.44 |
Parameter estimate | 4.04 | 0.41 | 0.22 | 0.23 | 3.46 | 1.38 | –0.65 | 0.16 | ||
Probability | >0.99 | 0.54 | 0.90 | 1.00 |
In this example, the OLS and model-averaged parameters are overall very similar indeed, despite the high degree of collinearity. In large part, this results from the overall moderate effects of the predictors on the response variable. Given that we would expect the model-averaged parameters to behave in a more stable manner under such a high level of collinearity, the similarity of the results should lead us to conclude (1) that the results of the OLS model are not hugely biased by the underlying collinearity in the data; (2) model selection should be a useful way to proceed in this dataset as the analysis indicates that the two best models contain the two collinear variables, but that the model containing the third (area) barely improves the relative fit to the data.
Measurement error in the face of collinearity
The effects of this bias become extremely important when collinearity between the variables exists. As shown in Fig. 4b, c, collinearity results in the effect of x_{2} being progressively underestimated, and the effect of x_{1} over-estimated. What is happening is that the measurement error in x_{2} results in under-estimation in the effect of this variable and, as the collinearity between the predictor increases, the effect of x_{2} is mis-attributed to x_{1}.
How does this relate to the example data and analysis summarised in Table 1? Unfortunately, measurement error is not estimated for these data. However, intuitively one would expect that in a detailed behavioural study, group size would be more accurately measured than food availability as the former is based on direct counts of animals over a long period. Thus, without further information, one hypothesis is that the difference in magnitude of the effects of the predictors (group size is estimated to have a stronger effect than food availability) could be a consequence of different levels of measurement errors in the two predictors.
Discussion
Linear modelling with multiple predictors will always be an important and commonly used technique in behavioural and ecological research. There has been a great deal of debate about how such analyses are best conducted (e.g. Burnham and Anderson 1998, 2002; Garcia-Berthou 2001; Freckleton 2002; Link and Barker 2006) with major shifts in what is considered best practice (Rushton et al. 2004). The main assumptions of such analyses are well known, however the consequences of collinearity and measurement error for different techniques are only rarely examined. The key points I wish to make in this paper are that: (1) when using methods such as model averaging, there is a possibility that bias reduction by parameter shrinkage may yield incorrect results; (2) when measurement error and collinearity occur simultaneously, there are potentially severe problems for both estimation and hypothesis testing.
An important point to realise is that when all predictors have strong effects OLS and AIC-IT will be largely dependent on the same model. The methods differ most when predictors with weak effects are included: AIC-IT is the more efficient method for reducing the parameter estimates for predictors with weak or no effects, and this conclusion holds even in the face of weak to moderate collinearity. However, as noted in the previous paragraph, AIC-IT methods will yield biased parameter estimates when predictors are highly correlated, particularly when their effects are rather different.
Bias and variance
It is well known that there is a statistical trade-off between bias and variance in parameter estimates. This is exemplified in Fig. 1 in which the OLS parameter estimates are unbiased regardless of levels of collinearity; those from the model averaging method have lower variance, but begin to exhibit bias under high levels of collinearity. Conventional diagnostics such as variance inflation factors allow the extent of this effect to be estimated. The AIC-IT parameter estimates have a lower sampling variance for weak predictors because estimation is more heavily weighted to models that do not incorporate the weak predictors. The model averaged parameter estimates Eq. 4 are known as shrinkage estimators, because these have sampling distributions ‘shrunk’ back towards zero. This property of model-averaged parameters is discussed in detail by Burnham and Anderson (2002). The important point here is that under low to moderate levels of collinearity the AIC-IT estimates have lower variance than OLS and thus may be preferable.
What to do with collinear variables?
The usual advice with a pair of collinear variables is to combine them in some way or to eliminate one or the other (see introduction). The results above reveal that this may not always be the best course of action. In the analysis, it was noted when x_{2} is absent from the model, the slope for the effect of x_{1} would be b_{1} + rb_{2}, where r is the correlation between x_{2} and x_{1}. Thus, the effect of x_{1} would be systematically over or under-estimated, depending on the sign of the correlation between the predictors and the signs of the slopes. The consequence of this is that the way to deal with collinear variables will depend on their nature and interpretation. In the hypothetical example in the introduction, rainfall and temperature would be expected to be negatively correlated with each other because hot conditions are usually associated with low rainfall. Low rainfall depresses growth of plants, whereas high temperatures promote growth. In this instance, it would not make sense to combine the two variables, or to omit one of them, even if they are highly correlated. Doing so would risk under-estimating the effects of the included variable and of mis-modelling the underlying determinants of growth. On the other hand, if the dataset contained both rainfall and soil moisture as predictors, these two variables are simply different ways of measuring the same quantity, i.e. the amount of water available. As a consequence, it would seem sensible to combine these, or just include one or the other.
If collinearity exists in a dataset, then there is no justification at all for using automated regression selection, such as stepwise regression. These techniques are widely criticised in the statistical literature and beyond because they result in biased parameter estimates with degenerate sampling distributions and a high probability of Type I errors (Chatfield 1996; Burnham and Anderson 1998, 2002; Whittingham et al. 2006). Collinearity reflects important structure in the data, which needs to be understood and dealt with explicitly. Although some authors have suggested ways in which Stepwise methods may be amended to deal with such issues (e.g. Hegyi and Garamszegi 2010) such suggestions are computational kludges, and ignore the wider problems (e.g. see Burnham et al. 2010). In most cases that researchers believe they should be conducting selection (e.g. using stepwise methods as envisaged by Hegyi and Garamszegi (2010), they would probably be better advised to use a full model (Whittingham et al. 2006; Forstmeier and Schielzeth 2010) and the results here largely endorse that conclusion.
AIC-IT methods are generally robust to collinearity. However, some problems can arise when predictors are highly correlated, particularly when their effects are rather different. The most extreme problem arises when one predictor has a weak effect, but is strongly correlated with another which has a strong effect. This situation can easily arise in real data. In a behavioural ecological context, for example, Székely et al. (2004) found a strong correlation between sexual dimorphism in body size and dimorphism in bill size in shorebirds. However, when compared with other variables in the dataset, the underlying correlates of these two measures of size were rather different. If not recognised, this type of variation can lead to substantial bias in both parameters (e.g. see Fig. 1f). These methods should be used with caution in such cases. In the example discussed above, one pragmatic fix would have been to omit models 2 and 3 from consideration when x_{1} and x_{2} are highly correlated. The justification for doing this would be that the models are essentially indistinguishable. The relative fits of model 1 and model 4 would allow the effects of the two predictors to be measured together and contrasted with a model including neither and should yield unbiased parameter estimates. In this example, this would essentially be the same as conducting an OLS analysis.
Other techniques exist for analysing data which contain collinearity. Ridge regression is one such approach (e.g. Draper and Smith 1998). This is a method that was developed with a view to allowing parameter estimation in cases if the collinearity between predictors is so extreme that the normal equations used to solve OLS problems contain a singular cross-product of the predictors. The method works by adding a parameter that modifies the normal equations to reduce the inflation of variance that results from collinearity. The resultant parameter estimates will be biased; however, this bias is traded off against reduced variance in parameter estimates. The downsides of the method are that the choice of parameter is arbitrary (although there are diagnostics that can be employed). The technique is most useful in generating predictions from a given model, and is less useful for comparing a suite of models.
Measurement error in predictors
Measurement error in predictors is almost never quantified or dealt with. This is despite the known issues with measurement error in regression models (Carroll et al. 2006) and the consequences for inference and estimation in practical analyses. In one of the few analyses to have addressed this, Linden and Knape (2009) showed that measurement error in predictors can result in the underestimation of environmental impacts on population dynamics, so the consequences for not estimating this error are demonstrably important.
In principle, if the measurement error in predictors has been quantified this can be dealt with. Several techniques exist for doing this, including simulation extrapolation (SIMEX; Cook and Stefanski 1994; Stefanski and Cook 1995), Bayesian methods (Fox and Glas 2003), multi-level models (Goldstein 1995), expectation maximization (Schafer 1987) or likelihood methods (Carroll et al. 1984). The technology exists with which to deal with measurement error; the limitation is that measurement errors in data are rarely quantified. The need to quantify measurement error has been emphasised in the ecological literature in the context of population modelling (Shenk et al. 1998; Ellner et al. 2002; Dennis et al. 2006; Freckleton et al. 2006; Linden and Knape 2009) and it is increasingly appreciated that this is an important component of variability in data that needs to be accounted for.
The simulations reported in Figs 4 and 5 were designed to illustrate that if measurement error differs between collinear predictors, then the consequences for estimation and inference can be particularly severe. If predictors are correlated with each other, but have different effects on the response variable, then differences between in the degree of measurement error can lead to extremely biased results. This is by no means a contrived situation: for instance, the example of plant growth given above is one in which two strongly negatively correlated drivers (temperature and rainfall) can have contrasting effects for the same underlying process (hot weather leads to a positive effect of temperature positive and a negative effect of rainfall). The simulation results indicate that if this is the case, then the resultant models can yield incorrect parameter estimates, the inference based on those parameters is wrong, and parameters have low sampling variance. Because of measurement error, the underlying correlation between the predictors would not be identified and the likely problem not diagnosed.
Dealing with this issue is difficult, and obviously impossible if measurement error has not been quantified. The main recommendation is that, assuming error has been estimated, one should proceed with caution if the relative level of error differs greatly between predictors, even if the level of collinearity is low. This is because measurement error can mask correlations between the variables and because differences in the level of error will be manifest in the relative values of slopes for the predictors.
Conclusions
Linear models and multiple regressions are extremely powerful tools, especially when combined with large datasets. The downside is that frequently data are generated non-experimentally and we are reliant on natural variation in observational data. The price to pay is that frequently we do not understand the structure of the data, and that correlations between variables and their error structure can complicate analyses. The statistical tools exist to deal with such complexity. However, we need to be aware of the possible pitfalls and of how they can be diagnosed.
Acknowledgements
The author is funded by a Royal Society University Research Fellowship. Thanks to Tom Webb, the referees and László Garamszegi for comments on an earlier version of the MS.