ANOVA

Herzog, Michael H.; Francis, Gregory; Clarke, Aaron

doi:10.1007/978-3-030-03499-3_6

Michael H. Herzog⁴,
Gregory Francis⁵ &
Aaron Clarke⁶

Part of the book series: Learning Materials in Biosciences ((LMB))

33k Accesses
4 Citations

Abstract

In Chap. 3, we examined how to compare the means of two groups. In this chapter, we will examine how to compare means of more than two groups.

You have full access to this open access chapter, Download chapter PDF

FormalPara What You Will Learn in This Chapter

In Chap. 3, we examined how to compare the means of two groups. In this chapter, we will examine how to compare means of more than two groups.

1 One-Way Independent Measures ANOVA

Suppose we wish to examine the effects of geographic region on tree heights. We might sample trees from near the equator, from the 49th parallel, and from the 60th parallel. We want to know whether the mean tree heights from all three regions are the same (Fig. 6.1). Since we have three regions we cannot use the t-test because the t-test only works if we are comparing two groups.

In principle we could compute three t-tests to compare all possible pairs of means (equator vs 49, equator vs 60, and 49 vs 60). However in this case, as shown in Chap. 5, we would face the multiple testing problem with the unpleasant side effect of increasing our Type I error rate as the number of comparisons increases. Situations like this are a case for an analysis of variance (ANOVA), which uses a clever trick to avoid the multiple testing problem.

2 Logic of the ANOVA

Terms

There are many terms for a 1-way ANOVA with m-groups:

way = factor
group = treatment = level

The logic of the ANOVA is simple. We simplify our alternative hypothesis by asking whether or not at least one of the tree populations is larger than the others. Hence, we are stating one hypothesis instead of three by lumping all alternative hypotheses together:

Null hypothesis:

$$\displaystyle \begin{aligned} H_0: \mu_1 = \mu_2 = \mu_3 \end{aligned}$$

Lumped alternative hypotheses

$$\displaystyle \begin{aligned} H_1: \mu_1 = \mu_2 \neq \mu_3\\ H_1: \mu_1 \neq \mu_2 = \mu_3\\ H_1: \mu_1 \neq \mu_3 = \mu_2\\ H_1: \mu_1 \neq \mu_2 \neq \mu_3\end{aligned} $$

The ANOVA assumes, similarly to the t-test, that all groups have the same population variances σ. If the null hypothesis is true, then the population means are equal for all three groups of trees. Any observed differences in the sample means come from the variance σ alone, which is due to random differences in tree heights (noise), but not to systematic differences in tree heights with geographic region (Fig. 6.1). It turns out that when the null hypothesis is true, the variability between means can be used to estimate σ (by multiplying by the sample sizes). An ANOVA compares this between means estimate to a direct estimate that is computed within each group.

Now assume that the mean tree heights in the three geographic regions are in fact different. In this case, the individual tree heights depend on both the variance within a group σ and the variability between the group means. In this case, the estimate of σ based on the variability between means tends to be larger than σ. In contrast, the estimate of σ based on the variability within each group tends to be similar to the true value. The ANOVA divides the two estimated variances and obtains an F-value:

$$\displaystyle \begin{aligned} F = \frac{\text{Variance estimate based on variability between group means}}{\text{Variance estimate based on variability within groups}}\end{aligned} $$

Formally, this equation is expressed as:

$$\displaystyle \begin{aligned} F = \frac{\frac{\sum_{j=1}^k n_j(M_j-M_G)^2}{k-1}}{\sum_{j=1}^k\frac{\sum_{i=1}^{n_j}(x_{ij}-M_j)^2}{n_j-1}} \end{aligned}$$

where k is the number of groups (three tree populations), n _j is the number of scores in group j (the number of trees within each sampled geographic region), M _j is the mean for group j (mean of geographic region sample j), M _G is the grand mean of all scores pooled together, and x _ij is the ith score for group j (the height of a single tree). To make it easier to distinguish the means from individual scores we use the symbols M _j and M _G rather than the traditional symbol for a sample mean $\bar {x}$. The multiplication by n _j in the numerator weights the deviations of the group means around the grand mean by the number of trees in each group so that the numbers of scores contributing to the variance estimates are equated between the numerator and denominator.

Consider two extreme examples. First, the null hypothesis is true (as in Fig. 6.2 left). In this case, the variance estimates for both the numerator and the denominator are similar and will produce an F-value that is close to 1. Next, let us consider an example of an alternative hypothesis where the differences between tree heights in the three geographic regions are large and σ is very small, i.e., the tree heights differ largely between the three populations but are almost the same within each population (as in Fig. 6.2 right). The variability in the measurements is mostly determined by the group differences and the F-value is large.

Just as in the t-test, a criterion is chosen for statistical significance to set the Type I error rate to a desired rate (e.g., α = 0.05). When F exceeds the criterion, we conclude that there is a significant difference (i.e., we reject the null hypothesis of equality between the group means).

The tree example is a one-way ANOVA, where there is one factor (tree location) with three groups (regions) within the factor. The groups are also called levels and the factors are also called ways. There can be as many levels as you wish within a factor, e.g. many more regions, from which to sample trees. A special case is a one-way independent measures ANOVA with two levels, which compares two means as does the t-test. In fact, there is a close relationship between the two tests and in this case it holds that: F = t ². The p-value here will be the same for the ANOVA and the two-tailed t-test. Hence, the ANOVA is a generalization of the t-test.

As with the t-test, the degrees of freedom play an important role in computing the p-value. For a one-way independent measures ANOVA with k levels, there are two types of degrees of freedom df ₁ and df ₂, respectively. In general, df ₁ = k − 1 and df ₂ = n − k where n is the total number of sampled scores pooled over all groups, e.g., all trees in the three groups. The total of the degrees of freedom is df ₁ + df ₂ = n − 1.

3 What the ANOVA Does and Does Not Tell You: Post-Hoc Tests

Assume our ANOVA found a significant result. What does it tell us? We reject the null hypothesis that all means are equal:

$$\displaystyle \begin{aligned} H_0: \mu_1 = \mu_2 = \mu_3 \end{aligned}$$

thereby accepting the alternative hypothesis, which can mean that any of the following are true:

$$\displaystyle \begin{aligned} H_1: \mu_1 = \mu_2 \neq \mu_3\\ H_1: \mu_1 \neq \mu_2 = \mu_3\\ H_1: \mu_1 \neq \mu_3 = \mu_2\\ H_1: \mu_1 \neq \mu_2 \neq \mu_3 \end{aligned} $$

By rejecting the H ₀, we accept one of the alternative hypotheses—but we do not know which one. This is the price of avoiding multiple testing by lumping the four alternative hypotheses into one.

Here, the ANOVA offers a second trick. If we rejected the null hypothesis, it is appropriate to compare pairs of means with what are called “post-hoc tests,” which, roughly speaking, corresponds to computing pairwise comparisons. Contrary to the multiple testing situations discussed in Chap. 5, these multiple comparisons do not inflate the Type I error rate because they are only conducted if the ANOVA finds a main effect.

There are many post-hoc tests in the statistical literature. Commonly used post-hoc tests include: Scheffé, Tukey, and REGW-Q. The process is best described with an example, which is provided at the end of this chapter.

4 Assumptions

The assumptions of the ANOVA are similar to the assumptions for the t-test described in Chap. 4.

1.
Independent samples.
2.
Gaussian distributed populations.
3.
The independent variable is discrete, while the dependent variable is continuous.
4.
Homogeneity of variance: All groups have the same variance.
5.
The sample size needs to be determined before the experiment.

5 Example Calculations for a One-Way Independent Measures ANOVA

5.1 Computation of the ANOVA

Suppose there is a sword fighting tournament with three different types of swords: light sabers, Hattori Hanzo katanas, and elvish daggers (see Fig. 6.3). We are asking whether there are differences in the number of wins across swords. Hence, our null hypothesis is that there is no difference. The data and the computation of the F-value are shown in Fig. 6.3.

Our final^{Footnote 1} F-value is 9.14. This means that the variability of the group means around the grand mean is 9.14 times the variability of the data points around their individual group means. Hence, much of the variability comes from differences in the means, much less comes from variability within each population. An F-value of 9.14 leads to a p-value of 0.0039 < 0.05 and we conclude that our results are significant, i.e., we reject the null hypothesis that all three sword types yield equal mean numbers of wins (F(2, 12) = 9.14, p = 0.0039). Furthermore, we can conclude that at least one sword type yields a different number of wins than the other sword types. We can now use one of the various post-hoc tests to find out which sword(s) is/are superior.

5.2 Post-Hoc Tests

Various procedures exist for performing post-hoc tests, but we will focus here on the Scheffé test in order to illustrate some general principles.

The idea behind the Scheffé test is to perform multiple comparisons by computing pairwise ANOVA’ s (e.g., light sabers vs. katanas, light sabers vs. elvish daggers, and katanas vs. elvish daggers). One assumption of the ANOVA is that all populations have equal variances. If this is true, then the best estimate of the variability within each population is the pooled estimate from the overall ANOVA calculated by MS _within (i.e., 3.83 in this case). The Scheffé test also uses df _between from the overall ANOVA, and the calculations for performing this test are illustrated in Fig. 6.4.

The p-value for each of the comparisons is computed using the degrees of freedom from the original ANOVA (i.e., df _between = 2 and df _within = 12). This yields the results in Table 6.1 for our post-hoc tests. Only the second and third comparisons are below our critical threshold of α = 0.05, thus, we can conclude that the light sabers differ from the elvish daggers (F(2, 12) = 5.22, p = 0.023), and that katanas also differ from elvish daggers (F(2, 12) = 8.15, p = 0.006), but that we failed to find a significant difference between light sabers and katanas (F(2, 12) = 0.33, p = 0.728).

Table 6.1 Post-hoc Scheffé test results for our three comparisons

Full size table

A common way to illustrate these differences is to plot a graph showing the mean number of wins for the three sword types with error bars denoting standard errors around each mean, and lines connecting the significantly different sword types with asterisks above them (Fig. 6.5).

6 Effect Size

As with the t-test, the p-value from an ANOVA confounds the effect size and the sample size. It is always important to look at the effect size, which for an ANOVA is denoted by η ². The effect size η ² tells you the proportion of the total variability in the dependent variable that is explained by the variability of the independent variable. The calculation is:

$$\displaystyle \begin{aligned} \eta^2 = \frac{SS_{between}}{SS_{total}}\end{aligned} $$

with

$$\displaystyle \begin{aligned} SS_{between} = \sum_{j=1}^k n_j(\bar{x}_j-M_G)^2 {} \end{aligned} $$

(6.1)

$$\displaystyle \begin{aligned} SS_{total} = \sum_{i=1}^n\sum_{j=1}^k (x_{ij}-M_G)^2 {}\end{aligned} $$

(6.2)

where M _G is the grand mean (i.e., average over all data points). This ratio tells you the proportion of the total variability in the data explained by variability due to the treatment means. For the above sword example the effect size is η ² = 0.60, which is a large effect according to Cohen, who provided guidelines for effect sizes (Table 6.2).

Table 6.2 Effect size guidelines according to Cohen

Full size table

7 Two-Way Independent Measures ANOVA

The one-way independent measures ANOVA generalizes nicely to cases with more than one factor. Here, we will discuss the simplest of such cases, the two-factor design.

Suppose that you and your friends acquire super powers in a science experiment and you are preparing to start your life of fighting crime. You and your super hero friends do not want your enemies to hurt your loved ones so you need costumes to conceal your identity. Furthermore, sometimes you might fight crime during the day, and other times you might fight crime at night. You want to know which costume material (spandex, cotton, or leather) will be the best for crime fighting as measured by the number of evil villains a hero can catch while wearing costumes made from each material, and you want to know if there is an effect of time of day on which material is best. You assign each friend to a costume material and time of day condition and count the number of evil villains each hero catches. You have different friends in each group. In this case, there are three separate hypotheses that we can make about the data:

1.
H ₀: There is no effect of time of day on the number of villains caught.

H ₁: The number of villains caught during the day are different from the number of villains caught at night.
2.
H ₀: There is no effect of costume material on the number of villains caught.

H ₁: At least one costume material yields different numbers of villains caught than the other costume materials.
3.
H ₀: The effect of time of day on the number of villains caught does not depend on costume material.

H ₁: The effect of time of day on the number of villains caught does depend on costume material.

The first two null hypotheses relate to what are called main effects. The two main hypotheses are exactly the same as computing two one-way ANOVAs. The third hypothesis is a new type of hypothesis and pertains to the interaction between the two factors, costume and day time. To measure the main effect of costume material, we take the average number of villains caught in the spandex group, averaging over both day and night conditions, and compare this with the same averages for the cotton and leather costume conditions. To measure the main effect of time of day, we look at the average number of villains caught for the day condition, averaging over the spandex, cotton, and leather costume material conditions, and compare this with the same average for the night condition.

For the interaction, we consider all groups separately, looking at the number of villains caught for spandex, cotton and leather costume groups separately as a function of day- and night-time crime-fighting conditions. If there is a significant interaction, then the effects of time of day on the number of villains caught will depend on which costume material we are looking at. Conversely, the effect of costume material on the number of villains caught will depend on which time of day our friends are fighting crime at.

Testing these three null hypotheses requires three separate F-statistics. Each F-statistic will use the same denominator as in the one-way ANOVA (i.e., the pooled variance of the data about the treatment means, or MS _within as shown in Fig. 6.3), but the numerators (MS _between) will be specific for the particular hypotheses tested.

Figure 6.6 shows example raw data and the means being compared for the three hypotheses being tested (see margins). When pooling over time of day it looks like costume material has very little effect on crime fighting abilities. When pooling over costume material, it looks like time of day has also little effect on crime fighting abilities. It is only when we consider each mean individually that we can see the true effects of time of day and costume material on how many villains our friends are catching: there is an interaction between costume material and time of day in relating the number of villains caught (Fig. 6.7). This interaction is such that spandex is better during the day and leather better at night, with cotton always being somewhere in the middle.

This example illustrates the value of a two-factor design. Had we done only separate one-way ANOVAs examining the relationships between costume material and number of villains caught, or time of day and number of villains caught, we would have found little or no effects. Including both variables reveals the true nature of both effects, showing the effect of one to depend on the level of the other. Figure 6.8 demonstrates three possible outcome patterns that isolate just one significant effect (Main effect of A, Main effect of B, Interaction) without any of the other effects. It is also possible to have combinations of main and interaction effects.

Another virtue of a two-factor design relative to a one-factor design is that variability that would otherwise be included in the error term (i.e., MS _within) is now partly explained by variability due to another factor, thereby reducing MS _within and increasing the power for detecting effects when present.

Thus, it may seem that the more factors we add the better we will understand the data and obtain significant results. However, this is not true because we lose power for each factor we are adding due to the fact that we have fewer scores contributing to each mean. Typically, larger samples are needed when the number of factors increases.

Importantly, if we find a significant interaction, the main effect varies depending on the other factor. Thus, we should usually refrain from making conclusions about the main effect if there is an interaction.

The one-way ANOVA avoids the multiple testing problem. However, a multi-way ANOVA reintroduces a kind of multiple testing problem. For example, consider a 2 × 2 ANOVA, with a significance criterion of 0.05. A truly null data set (where all four population means are equal to each other) has a 14% chance of producing at least one p < 0.05 among the two main effects and the interaction. If you use ANOVA to explore your data set by identifying significant results, you should understand that such an approach has a higher Type I error rate than you might have intended.

A typical statistical software package outputs the results of a two-way ANOVA as in Table 6.3.

Table 6.3 Typical statistical software outputs for a two-way ANOVA

Full size table

8 Repeated Measures ANOVA

The ANOVA we have discussed up to now is a straightforward extension of the independent samples t-test. There also exists a generalization of the dependent samples t-test called the repeated measures ANOVA. You can use this kind of ANOVA when, for example, measuring some aspect of patient health before, during, and after some treatment program. In this case, the very same patients undergo three measurements. A repeated measures ANOVA has higher power than the independent measures ANOVA because it compares the differences within patients first before comparing across the patients, thus, reducing variability in the data. Example output from a repeated measures ANOVA is provided in Table 6.4.

Table 6.4 Typical statistical software outputs for a repeated measures ANOVA

Full size table

Take Home Messages

1.
With an ANOVA you can avoid the multiple testing problem—to some extent.
2.
More factors may improve or deteriorate power.

Notes

1.
If the data are analyzed by a statistics program, you will get F = 9.13. The difference is due to rounding of MS _Within in Fig. 6.3.

Author information

Authors and Affiliations

Brain Mind Institute, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
Michael H. Herzog
Dept. Psychological Sciences, Purdue University, West Lafayette, IN, USA
Gregory Francis
Psychology Department, Bilkent University, Ankara, Turkey
Aaron Clarke

Authors

Michael H. Herzog
View author publications
You can also search for this author in PubMed Google Scholar
Gregory Francis
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Clarke
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Herzog, M.H., Francis, G., Clarke, A. (2019). ANOVA. In: Understanding Statistics and Experimental Design . Learning Materials in Biosciences. Springer, Cham. https://doi.org/10.1007/978-3-030-03499-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-03499-3_6
Published: 14 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03498-6
Online ISBN: 978-3-030-03499-3
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics

ANOVA

Abstract

1 One-Way Independent Measures ANOVA

2 Logic of the ANOVA

Terms

3 What the ANOVA Does and Does Not Tell You: Post-Hoc Tests

4 Assumptions

5 Example Calculations for a One-Way Independent Measures ANOVA

5.1 Computation of the ANOVA

5.2 Post-Hoc Tests

6 Effect Size

7 Two-Way Independent Measures ANOVA

8 Repeated Measures ANOVA

Take Home Messages

Notes

Author information

Authors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation