FormalPara What You Will Learn in This Chapter

In Part III of this book, we will show that combining data from multiple experiments can provide completely new insights. For example, whereas the statistical output of each experiment itself might make perfect sense, sometimes the combination of data across experiments indicates problems. How likely is it that four experiments with a small effect and a small sample size all lead to significant results? We will show it is often very unlikely. As a simple consequence, if experiments always produce significant results, the data seem to good to be true. We will show how common, but misguided, scientific practice leads to too-good-to-be-true data, how this practice inflates the Type I error rate, and has led to a serious science crisis affecting most fields where statistics plays a key role. In this respect, Part III generalizes the Implications from Chap. 3. At the end, we will discuss potential solutions.

In this chapter, we extend the standardized effects size from Chap. 2 and show how to combine data across experiments to compute meta-statistics.

1 Standardized Effect Sizes

As we noted in Part I of this book, much of statistics involves discriminating signal and noise from noise alone. For a standard two sample t-test, the signal to noise ratio is called Cohen’s d, which is estimated from data as (see Chap. 3):

$$\displaystyle \begin{aligned} d= \frac{{\overline x}_1 - {\overline x}_2}{s}. \end{aligned}$$

Cohen’s d tells you how easily you can discriminate different means. The mean difference is in the numerator. A bigger difference is easier to detect than a smaller one, but we also need to take the noise into account. A bigger standard deviation makes it more difficult to detect a difference of means (see Chap. 2). When n 1 = n 2 = n, the t-value for a two-sample t-test is just:

$$\displaystyle \begin{aligned} t= \frac{{\overline x}_1 - {\overline x}_2}{s_{{\overline x}_1 - {\overline x}_2}}=\frac{{\overline x}_1 - {\overline x}_2}{s \sqrt{\frac{2}{n}}}= \frac{d}{ \sqrt{\frac{2}{n}}} = d \sqrt{\frac{n}{2}} \end{aligned}$$

So, a t-value simply weights Cohen’s d by (a function of) sample size(s). As mentioned in Chap. 3, it is always good to check out the effect size. Unfortunately, many studies report just the p-value, which confuses effect size and sample size. Based on the above equation, we can compute Cohen’s d from the reported t-value and the sample sizes:

$$\displaystyle \begin{aligned} d = t \sqrt{\frac{2}{n}} \end{aligned}$$

An important property of Cohen’s d is that its magnitude is independent of the sample size, which is evident from d being an estimate of a fixed (unknown) population value.Footnote 1

In Chap. 3, we have shown that we can estimate δ by d. However, d is only a good estimator when the sample size is large. For rather small samples, d tends to systematically overestimate the population effect size δ. This overestimation can be corrected by using Hedges’ g instead of d:

$$\displaystyle \begin{aligned} g = \left(1- \frac{3}{4\left(2n -2\right)-1}\right)d \end{aligned}$$

For nearly all practical purposes Hedges’ g can be considered to be the same as Cohen’s d. We introduced it here because we will use Hedges’ g to compute meta-analyses. The Appendix to this chapter includes formulas for when n 1n 2 and for other types of experimental designs.

2 Meta-analysis

Suppose we run the same (or very similar) experiments multiple times. It seems that we should be able to pool together the data across experiments to draw even stronger conclusions and reach a higher power. Indeed, such pooling is known as meta-analysis. It turns out that the standardized effect sizes are quite useful for such meta-analyses.

Table 9.1 summarizes statistical values of five studies that concluded that handling money reduces distress over social exclusion. Each study used a two-sample t-test, and the column labeled g provides the value of Hedges’ g, which is just an estimate of the effect size.

Table 9.1 Data from five experiments used for a meta-analysis

To pool the effect sizes across studies, it is necessary to take the sample sizes into account. An experiment with 46 subjects in each group counts a bit more than an experiment with 36 subjects in each group. The final column in Table 9.1 shows the weighted effect size, w × g, for each experiment (see the Appendix for the calculation of w). The pooled effect size is computed by summing the weighted effect sizes and dividing by the sum of the weights:

$$\displaystyle \begin{aligned} g^* = \frac{\sum_{i=1}^5 w_ig_i}{\sum_{i=1}^5 w_i} = 0.632.\end{aligned} $$

This meta-analytic effect size is the best estimate of the effect size based on these five experiments. Whether it is appropriate to pool standardized effect sizes in this way largely depends on theoretical interpretations of the effects. If your theoretical perspective suggests that these experiments all measure essentially the same effect, then this kind of pooling is appropriate, and you get a better estimated effect size by doing such pooling. On the other hand, it would not make much sense to pool together radically different experiments that measured different effects.

Meta-analyses can become quite complicated when experiments vary in structure (e.g., published analyses may involve t-tests, ANOVAs, or correlations). Despite these difficulties, meta-analysis can be a convenient way to combine data across experiments and thereby get better estimates of effects.

Take Home Messages

  1. 1.

    Pooling effect sizes across experiments produces better estimates.

  2. 2.

    Combining data across experiments increases power.