Variance, Analysis Of
Abstract
Analysis of variance (ANOVA) is a statistical procedure for summarizing a classical linear model – a decomposition of sum of squares into a component for each source of variation in the model – along with an associated test (the F-test) of the hypothesis that any given source of variation in the model is zero. More generally, the variance decomposition in ANOVA can be extended to obtain inference for the variances of batches of parameters (sources of variation) in multilevel regressions. ANOVA is a useful addition to regression in that it structures inferences about batches of parameters.
Keywords
Analysis of variance (ANOVA) Balanced and unbalanced data Bayesian inference Classical linear models Classical method of moments Contrast analysis Experimental economics Finite-population standard deviation Fixed effects and random effects Generalized linear models Linear models Linear regression Multilevel models Nonexchangeable models Probability Super-population standard deviation Variance decompositionIntroduction
Analysis of variance (ANOVA) represents a set of models that can be fit to data, and also a set of methods for summarizing an existing fitted model. We first consider ANOVA as it applies to classical linear models (the context for which it was originally devised; Fisher 1925) and then discuss how ANOVA has been extended to generalized linear models and multilevel models. Analysis of variance is particularly effective for analysing highly structured experimental data (in agriculture, multiple treatments applied to different batches of animals or crops; in psychology, multi-factorial experiments manipulating several independent experimental conditions and applied to groups of people; industrial experiments in which multiple factors can be altered at different times and in different locations).
At the end of this article, we compare ANOVA with simple linear regression.
Analysis of Variance for Classical Linear Models
ANOVA as a Family of Statistical Methods
When formulated as a statistical model, analysis of variance refers to an additive decomposition of data into a grand mean, main effects, possible interactions and an error term. For example, Gawron et al. (2003) describe a flight-simulator experiment that we summarize as a 5 × 8 array of measurements under five treatment conditions and eight different airports. The corresponding two-way ANOVA model is y_{ij} = μ + α_{i} + β_{j} + ε_{ij.} The data as described here have no replication, and so the two-way interaction becomes part of the error term. (If, for example, each treatment x airport condition were replicated three times, then the 120 data points could be modelled as y_{ijk} = μ + α_{i} + β_{j} + γ_{ij} + ε_{ijk}, with two sets of main effects, a two-way interaction, and an error term.)
- 1.
For each source of variation, the degrees of freedom represent the number of effects at that level, minus the number of constraints (the five treatment effects sum to zero, the eight airport effects sum to zero, and each row and column of the 40 residuals sums to zero).
- 2.
The total sum of squares – that is, \( {\sum}_{i=1}^5{\sum}_{j=1}^8{\left({y}_{ij}-\overline{y}..\right)}^2 \) – is 0.078 + 3.944 + 1.417, which can be decomposed into these three terms corresponding to variance described by treatment, variance described by airport, and residuals.
- 3.
The mean square for each row is the sum of squares divided by degrees of freedom. Under the null hypothesis of zero row and column effects, their mean squares would, in expectation, simply equal the mean square of the residuals.
- 4.
The F-ratio for each row (except for the last) is the mean square, divided by the residual mean square. This ratio should be approximately 1 (in expectation) if the corresponding effects are zero; otherwise we would generally expect the F-ratio to exceed 1. We would expect the F-ratio to be less than 1 only in unusual models with negative within-group correlations (for example, if the data y have been renormalized in some way, and this had not been accounted for in the data analysis).
- 5.
The p-value gives the statistical significance of the F-ratio with reference to the \( {F}_{\nu_1,{\nu}_2} \), where ν_{1} and v_{2} are the numerator and denominator degrees of freedom, respectively. (Thus, the two F-ratios in Fig. 1 are being compared to F_{4,28} and F_{7},_{28} distributions, respectively.) In this example, the treatment mean square is lower than expected (an F-ratio of less than 1), but the difference from 1 is not statistically significant (a p-value of 82%), hence it is reasonable to judge this difference as explainable by chance, and consistent with zero treatment effects. The airport mean square is much higher than would be expected by chance, with an F-ratio that is highly statistically significantly larger than 1; hence we can confidently reject the hypothesis of zero airport effects.
Classical two-way analysis of variance for data on five treatments and eight airports with no replication
Source | Degrees of freedom | Sum of squares | Mean square | F-ratio | p-value |
---|---|---|---|---|---|
Treatment | 4 | 0.078 | 0.020 | 0.39 | 0.816 |
Airport | 7 | 3.944 | 0.563 | 11.13 | <0.001 |
Residual | 28 | 1.417 | 0.051 |
More complicated designs have correspondingly complicated ANOVA models, and complexities arise with multiple error terms. We do not intend to explain such hierarchical designs and analyses here, but we wish to alert the reader to such complications. Textbooks such as Snedecor and Cochran (1989) and Kirk (1995) provide examples of analysis of variance for a wide range of designs.
ANOVA to Summarize a Model That Has Already Been Fitted
We have just demonstrated ANOVA as a method of analysing highly structured data by decomposing variance into different sources, and comparing the explained variance at each level with what would be expected by chance alone. Any classical analysis of variance corresponds to a linear model (that is, a regression model, possibly with multiple error terms); conversely, ANOVA tools can be used to summarize an existing linear model.
The key is the idea of ‘sources of variation’, each of which corresponds to a batch of coefficients in a regression. Thus, with the model y = Xβ + ε, the columns of X can often be batched in a reasonable way (for example, in Table 1, a constant term, four treatment indicators, and seven airport indicators) and the mean squares and F-tests then provide information about the amount of variance explained by each batch.
Such models could be fitted without any reference to ANOVA, but ANOVA tools could then be used to make some sense of the fitted models, and to test hypotheses about batches of coefficients.
Balanced and Unbalanced Data
In general, the amount of variance explained by a batch of predictors in a regression depends on which other variables have already been included in the model. With balanced data, however, in which all groups have the same number of observations (for example, each treatment applied exactly eight times, and each airport used for exactly five observations), the variance decomposition does not depend on the order in which the variables are entered. ANOVA is thus particularly easy to interpret with balanced data. The analysis of variance can also be applied to unbalanced data, but then the sums of squares, mean squares, and F-ratios will depend on the order in which the sources of variation are considered.
ANOVA for More General Models
Analysis of variance represents a way of summarizing regressions with large numbers of predictors that can be arranged in batches, and a way of testing hypotheses about batches of coefficients. Both these ideas can be applied in settings more general than linear models with balanced data.
F-tests
In a classical balanced design (as in the example in Table 1), each F-ratio compares a particular batch of effects to zero, testing the hypothesis that this particular source of variation is not necessary to fit the data.
More generally, the F-test can compare two nested models, testing the hypothesis that the smaller model fits the data adequately (so that the larger model is unnecessary). In a linear model, the F-ratio is \( \frac{\left({\mathrm{SS}}_2-{\mathrm{SS}}_1\right)/\left({\mathrm{df}}_2-{\mathrm{df}}_1\right)}{{\mathrm{SS}}_1/{\mathrm{df}}_1} \), where SS_{1}, df_{1} and SS_{2}, df_{2} are the residual sums of squares and degrees of freedom from fitting the larger and smaller models, respectively.
For generalized linear models, formulas exist using the deviance (the log-likelihood multiplied by – 2) that are asymptotically equivalent to F-ratios. In general, such models are not balanced, and the test for including another batch of coefficients depends on which other sources of variation have already been included in the model.
Inference for Variance Parameters
A different sort of generalization interprets the ANOVA display as inference about the variance of each batch of coefficients, which we can think of as the relative importance of each source of variation in predicting the data. Even in a classical balanced ANOVA, the sums of squares and mean squares do not exactly do this, but the information contained therein can be used to estimate the variance components (Cornfield and Tukey 1956; Searle et al. 1992). Bayesian simulation can then be used to obtain confidence intervals for the variance parameters. As illustrated in this article we display inferences for standard deviations (rather than variances) because these are more directly interpretable. Compared with the classical ANOVA display, our plots emphasize the estimated variance parameters rather than testing the hypothesis that they are zero.
Generalized Linear Models
The idea of estimating variance parameters applies directly to generalized linear models as well as unbalanced data-sets. All that is needed is that the parameters of a regression model are batched into ‘sources of variation’. Figure 1 illustrates with a multilevel logistic regression model, predicting vote preference given a set of demographic and geographic variables.
Multilevel Models and Bayesian Inference
Analysis of variance is closely tied to multilevel (hierarchical) modelling, with each source of variation in the ANOVA table corresponding to a variance component in a multilevel model (see Gelman 2005). In practice, this can mean that we perform ANOVA by fitting a multilevel model, or that we use ANOVA ideas to summarize multilevel inferences. Multilevel modelling is inherently Bayesian in that it involves a potentially large number of parameters that are modelled with probability distributions (see, for example, Goldstein 1995; Kreft and De Leeuw 1998; Snijders and Bosker 1999). The differences between Bayesian and non-Bayesian multilevel models are typically minor except in settings with many sources of variation and little information on each, in which case some benefit can be gained from a fully Bayesian approach which models the variance parameters.
Related Topics
Finite Population and Super-Population Variances
So far in this article we have considered, at each level (that is, each source of variation) of a model, the standard deviation of the corresponding set of coefficients. We call this the finite-population standard deviation. Another quantity of potential interest is the standard deviation of the hypothetical super-population from which these particular coefficients were drawn. The point estimates of these two variance parameters are similar – with the classical method of moments, the estimates are identical, because the super-population variance is the expected value of the finite-population variance – but they will have different uncertainties. The inferences for the finite-population standard deviations are more precise, as they correspond to effects for which we actually have data.
There has been much discussion about fixed and random effects in the statistical literature (see Eisenhart 1947; Green and Tukey 1960; Plackett 1960; Yates 1967; LaMotte 1983; and Nelder 1977, 1994, for a range of viewpoints), and unfortunately the terminology used in these discussions is incoherent (see Gelman 2005, sec. 6). Our resolution to some of these difficulties is to always fit a multilevel model but to summarize it with the appropriate class of estimand – super-population or finite population – depending on the context of the problem. Sometimes we are interested in the particular groups at hand; at other times they are a sample from a larger population of interest. A change of focus should not require a change in the model, only a change in the inferential summaries.
Contrast Analysis
Non-exchangeable Models
In all the ANOVA models we have discussed so far, the effects within any batch (source of variation) are modelled exchangeably, as a set of coefficients with mean 0 and some variance. An important direction of generalization is to non-exchangeable models, such as in time series, spatial structures (Besag and Higdon 1999), correlations that arise in particular application areas such as genetics (McCullagh 2005), and dependence in multi-way structures (Aldous 1981; Hodges et al. 2005). In these settings, both the hypothesis-testing and variance-estimating extensions of ANOVA become more elaborate. The central idea of clustering effects into batches remains, however. In this sense, ‘analysis of variance’ represents all efforts to summarize the relative importance of different components of a complex model.
ANOVA Compared with Linear Regression
The analysis of variance is often understood by economists in relation to linear regression (for example, Goldberger 1964). From the perspective of linear (or generalized linear) models, we identify ANOVA with the structuring of coefficients into batches, with each batch corresponding to a ‘source of variation’ (in ANOVA terminology).
As discussed by Gelman (2005), the relevant inferences from ANOVA can be reproduced by using regression – but not always least-squares regression. Multilevel models are needed for analysing hierarchical data structures such as ‘split-plot designs’, where between-group effects are compared with group-level errors, and within-group effects are compared with data-level errors.
Given that we can already fit regression models, what do we gain by thinking about ANOVA? To start with, the display of the importance of different sources of variation is a helpful exploratory summary. For example, the two plots in Fig. 1 allow us to quickly understand and compare two multilevel logistic regressions, without getting overwhelmed with dozens of coefficient estimates.
More generally, we think of the analysis of variance as a way of understanding and structuring multilevel models – not as an alternative to regression but as a tool for summarizing complex high-dimensional inferences, as can be seen, for example, in Fig. 2 (finite-population and super-population standard deviations) and Figs. 3 and 4 (group-level coefficients and trends).
See Also
We thank Jack Needleman, Matthew Rafferty, David Pattison, Marc Shivers, Gregor Gorjanc, and several anonymous commenters for helpful suggestions and the National Science Foundation for financial support.
Bibliography
- Aldous, D. 1981. Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis: 581–598.Google Scholar
- Besag, J., and D. Higdon. 1999. Bayesian analysis of agricultural field experiments (with discussion). Journal of the Royal Statistical Society B: 691–746.Google Scholar
- Cochran, W., and G. Cox. 1957. Experimental designs. 2nd ed. New York: Wiley.Google Scholar
- Cornfield, J., and J. Tukey. 1956. Average values of mean squares in factorials. Annals of Mathematical Statistics: 907–949.Google Scholar
- Eisenhart, C. 1947. The assumptions underlying the analysis of variance. Biometrics 3: 1–21.CrossRefGoogle Scholar
- Fisher, R.A. 1925. Statistical methods for research workers. Edinburgh: Oliver and Boyd.Google Scholar
- Gawron, V., B. Berman, R. Dismukes, and J. Peer. 2003. New airline pilots may not receive sufficient training to cope with airplane upsets. Flight Safety Digest (July–August): 19–32.Google Scholar
- Gelman, A. 2005. Analysis of variance: Why it is more important than ever (with discussion). Annals of Statistics 33: 1–53.CrossRefGoogle Scholar
- Gelman, A., and J. Hill. 2006. Data analysis using regression and multilevel/ hierarchical models. New York: Cambridge University Press.CrossRefGoogle Scholar
- Gelman, A., C. Pasarica, and R. Dodhia. 2002. Let’s practice what we preach: Using graphs instead of tables. American Statistician 56: 121–130.CrossRefGoogle Scholar
- Goldberger, A. 1964. Econometric theory. New York: Wiley.Google Scholar
- Goldstein, H. 1995. Multilevel statistical models. 2nd ed. London: Edward Arnold.Google Scholar
- Green, B., and J. Tukey. 1960. Complex analyses of variance: General problems. Psychometrika 25: 127–152.CrossRefGoogle Scholar
- Hodges, J., Y. Cui, D. Sargent, and B. Carlin. 2005. Smoothed ANOVA. Technical report: Department of Biostatistics, University of Minnesota.Google Scholar
- Kirk, R. 1995. Experimental design: Procedures for the behavioral sciences. 3rd ed. Pacific Grove: Brooks/Cole.Google Scholar
- Kreft, I., and J. De Leeuw. 1998. Introducing multilevel modeling. London: Sage.CrossRefGoogle Scholar
- LaMotte, L. 1983. Fixed-, random-, and mixed-effects models. In Encyclopedia of statistical sciences, ed. S. Kotz, N. Johnson, and C. Read. New York: Wiley.Google Scholar
- McCullagh, P. 2005. Discussion of Gelman (2005). Annals of Statistics 33: 33–38.Google Scholar
- Nelder, J. 1977. A reformulation of linear models (with discussion). Journal of the Royal Statistical Society A 140: 48–76.CrossRefGoogle Scholar
- Nelder, J. 1994. The statistics of linear models: Back to basics. Statistics and Computing 4: 221–234.CrossRefGoogle Scholar
- Plackett, R. 1960. Models in the analysis of variance (with discussion). Journal of the Royal Statistical Society B 22: 195–217.Google Scholar
- Searle, S., G. Casella, and C. McCulloch. 1992. Variance components. New York: Wiley.CrossRefGoogle Scholar
- Snedecor, G., and W. Cochran. 1989. Statistical methods. 8th ed. Ames: Iowa State University Press.Google Scholar
- Snijders, T., and R. Bosker. 1999. Multilevel analysis. London: Sage.Google Scholar
- Yates, F. 1967. A fresh look at the basic principles of the design and analysis of experiments. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 4: 777–790.Google Scholar