The Core Concept of Statistics
Abstract
In Chaps. 1 and 2, we showed that proper conclusions need full information and that many popular measures, such as the Odds Ratio or the Percentage of Correct Responses, provide only partial information. In this chapter, we use the framework of SDT to understand statistical inference, including the role of the pvalue, a dominant term in statistics. As we show, the pvalue confounds effect size and sample size and, hence, also provides only partial information.
This chapter is about the essentials of statistics. We explain these essentials with the example of the ttest, which is the most popular and basic statistical test. This is the only chapter of this book where we go into details because, we think, the details help a great deal in understanding the fundamental aspects of statistics. Still, only basic math knowledge is required. The hasty or math phobic reader can go directly to Sect. 3.3 Summary, where we summarize the main findings and the key steps. Understanding at least this Summary is necessary to understand the rest of the book.
What You Will Learn in This Chapter
In Chaps. 1 and 2, we showed that proper conclusions need full information and that many popular measures, such as the Odds Ratio or the Percentage of Correct Responses, provide only partial information. In this chapter, we use the framework of SDT to understand statistical inference, including the role of the pvalue, a dominant term in statistics. As we show, the pvalue confounds effect size and sample size and, hence, also provides only partial information.
This chapter is about the essentials of statistics. We explain these essentials with the example of the ttest, which is the most popular and basic statistical test. This is the only chapter of this book where we go into details because, we think, the details help a great deal in understanding the fundamental aspects of statistics. Still, only basic math knowledge is required. The hasty or math phobic reader can go directly to Sect. 3.3Summary, where we summarize the main findings and the key steps. Understanding at least this Summary is necessary to understand the rest of the book.
3.1 Another Way to Estimate the SignaltoNoise Ratio
Terms

Hit Rate = Power

False Positive Rate = False Alarm = Type I error

Miss Rate = Type II error

d′ = Cohen’s δ = effect size = standardized effect size

Gaussian distribution = Normal distribution = Bell curve

Sample values, such as tree height, are also called Scores
Some definitions

Sample mean\(\overline {x} = \frac {1}{n} \sum _{i=1}^n x_i\)where the symbol \(\sum \) means “add up all following terms”

Sample variance\(s^2 = \frac {1}{n1}\sum _{i=1}^n(x_i\overline {x})^2\)

Sample standard deviation\(s = \sqrt {s^2}\)

Standard error\(s_{\overline x} = s/\sqrt {n}\)
Facts about sample means
 1.
The distribution of sample means \({\overline x}\) is Gaussian (Central Limit Theorem; CLT)
 2.
\(\overline {x} \rightarrow \mu \)
 3.
\(s_{\overline x} \rightarrow 0\)
3.2 Undersampling
Let us start with an example. We are interested in the hypothesis that the mean height of Alpine oak trees at the Northern rim is different than the mean height of oaks from the Southern rim. The straightforward way to address this hypothesis is to measure the height of all trees in the North and South rims, compute the means, and compare them. If the means are different, they are different. If they are the same, they are the same. It is that easy.
Let us now collect both a sample from the North and a sample from the South trees. If we find a difference of sample means we cannot know whether it was caused by a true difference of the tree population means or whether the population mean heights were the same but the difference was caused by undersampling. Hence, undersampling may lead to wrong conclusions. For example, even though there is no difference between the means for the North and South tree populations, we may decide that there is a difference because there was a difference in the sample means. In this case, we are making a False Alarm, also called a Type I error.
To understand how undersampling influences decisions, we will first study how likely it is that a sample mean deviates from the true mean by a certain amount. As we will see, the sampling error is determined by the standard deviation of the population, σ, and the sample size, n. Second, we will study how undersampling affects how well we can discriminate whether or not there is a difference in the mean height of the two tree populations. A simple equation gives the answer. The equation is nothing else but a d for mean values. Hence, we are in a SDT situation. Third, we want to control the Type I error rate. We will see that the famous pvalue just sets a criterion for the Type I error.
3.2.1 Sampling Distribution of a Mean
The equation shows again why with larger sample sizes the sampling error becomes smaller: as \(\sqrt {n}\) goes larger, \(\sigma _{\overline x}=\sigma /\sqrt {n}\) goes to zero. \(\sigma _{\overline x}\) depends on both n and σ. Suppose σ is zero, then all trees in our sample have the same height, which means all trees have the mean height μ_{North}, and hence we need to only measure the height of one tree. On the other hand, if σ is large, we need to sample many trees to obtain a good estimate of the population mean.
Summary
Because of undersampling, sample means likely differ from the true mean. The standard error, \(s_{\overline x}\), is a measure for the expected sampling error.
3.2.2 Comparing Means
Let us next see how undersampling affects a comparison of the means of the North and South trees. Obviously, there is a family of sampling distributions for the South trees too. In the following, we assume that the sample sizes and the population variances are the same for both tree populations. As mentioned, if our two samples contain all trees from both populations, we can simply compare the means and note any difference. For smaller samples, both sample means may strongly differ from the true means. First, we subtract the two sample means: \(\overline {x}_{North}\overline {x}_{South}\). For each pair of samples of North and South trees, we can compare the difference of sample means with the difference of the true means μ_{North} − μ_{South}. Hence, we have only one sampling distribution and the same situation as in the last subsection.
Let us recall our main question. We have collected one sample from the North trees and one sample from the South trees, respectively, with a sample size of n. Most likely there is a difference between the two sample means, i.e., \(\overline {x}_{North}\overline {x}_{South} \neq 0\). Does this difference come from undersampling even though there is no difference between the population means, or does this difference reflect a true difference in mean height? This is a classic SDT situation—just with means instead of single measurements. How well can we discriminate between the two alternatives? We can answer the question by computing the d′ or Cohen’s δ between the two alternatives. For the first alternative μ_{North} − μ_{South} = 0, meaning that there is no difference between the mean heights of the North and South trees, i.e., the noise alone distribution. The corresponding sampling distribution is centered around 0 since there is no difference. For the second alternative, there is a real difference and the sampling distribution is centered at μ_{North} − μ_{South}. Because we do not know the true values, we use estimates.^{2}
The tvalue is nothing else as a d′ applied to sampling distributions. Just as for all SDT situations, if t is large, it is fairly easy to distinguish whether a difference of means comes from the signal with noise distribution or from the noisealone distribution. If t is small, then it will be difficult to determine whether there is a true difference.^{3}
By splitting up the standard error, which is the measure for the sampling error, into s and n, we see that the tvalue is d (the estimated δ of the population distributions) multiplied by the square root of half the sample size.
Hence, the standard error of the sampling distribution of the means combines both sources of uncertainty, namely the population standard deviation and the uncertainty from undersampling. Second, we see that the tvalue is the product of the estimated d of the population distribution and a function of the sample size n. The tvalue combines effect and sample size.
Summary
We wanted to know whether or not two means are identical but have difficulties to decide because we have only inaccurate estimates caused by undersampling. This is a classic discrimination task, just with means instead of single measurements. The tvalue, which is easy to calculate from the samples we collected, is nothing else than the estimated d′ for this situation. Most importantly, the tvalue is a function of the estimated effect size d and the sample size n, namely, a multiplication of the estimated effect size d and the square root of the sample size n∕2.
3.2.3 The Type I and II Error
As in the yellow submarine example, a large t tells us that the discrimination between population means should be easy, while a small tvalue indicates that discrimination should be hard and we may easily arrive at wrong conclusions. It is now straightforward to decide about the null hypothesis. We compute t and then apply a criterion. If the computed tvalue is greater than the criterion, we take that as evidence that the estimated difference of means did not come from the noisealone distribution: there is a difference between the two means. If the computed tvalue is smaller than the criterion, then we do not have confidence that there is a difference between the two means. Maybe there is a difference, maybe not. We do not make any conclusions.
In practice, different fields use different criteria, which reflects their relative comfort levels with making Hits or False Alarms. For example, physics often follows the “5σ rule,” which requires t > 5 to claim that an experiment has found sufficient evidence that there is a difference between mean values. Compared to other fields, this is a very high criterion; and it partly reflects the fact that physicists often have the ability (and resources) to greatly reduce σ and s by improving their measurement techniques. In particle physics, the Large Hadron Collider produces trillions of samples. Fields such as medicine, psychology, neuroscience, and biology, generally use a criterion that (to a first approximation) follows a “2σ rule.” This less stringent criterion partly reflects the circumstances of scientific investigations in these fields. Some topics of interest are inherently noisy, and the population differences are small. Simultaneously, the perunit cost for medical or biological samples is often much higher than for many situations in physics; and in some situations (e.g., investigations of people with rare diseases) a large sample size is simply impossible to acquire.
SDT also tells us that any chosen criterion trades off Hits and False Alarms, and the 5σ and 2σ rules are no exception. Everything else equal, the 5σ rule will have fewer Hits than the 2σ rule. Likewise, everything else equal, the 5σ rule will have fewer False Alarms than the 2σ rule.
Rather than setting a criterion in terms of standard deviation σ, in many fields (including medicine, psychology, neuroscience, and biology), scientists want to keep the Type I error smaller than a certain value, e.g., 0.05. It should be clear why one wants to limit this kind of error: it would cause people to believe there is an effect when there really is not. For example, one might conclude that a treatment helps patients with a disease, but the treatment is actually ineffective, and thus an alternative drug is not used. Such errors can lead to deaths. From a philosophical perspective, scientists are skeptical and their default position is that there is no difference: a treatment does not work, an intervention does not improve education, or men and women have similar attributes. Scientists will deviate from this default skepticism only if there is sufficient evidence that the default position is wrong.
Summary
A Type I error occurs when there is no difference in the means, i.e. the null hypothesis is true, but we decide there is one. The Type I error is a False Alarm in terms of SDT. To decide about the null hypothesis, we compute t, which reflects the discriminability in terms of SDT, and then apply a criterion.
3.2.4 Type I Error: The pValue is Related to a Criterion
There is great flexibility in this approach. For example, you might suspect that Northern and Southern trees have different heights, but have no guess as to which would be larger. In this scenario, you might use a criterion t_{cv} = ±2.0, where a tvalue more extreme (further from zero) than 2.0 would be taken as evidence that there is a difference in population means. With this approach, the Type I error rate would be 0.0456, which is twice as large as for the onetailed test (see Fig. 3.5b). This test is called a “twotailed t test”.
As mentioned, within medicine, psychology, neuroscience, and biology, a common desired rate is 0.05. For large sample sizes and for situations where one considers both positive and negative tvalues (twotailed ttest), a p = 0.05 corresponds to t = ±1.96. Thus, setting the Type I error rate to 0.05 corresponds to setting a criterion of t_{cv} = ±1.96. This relationship is why these fields follow an (approximate) 2σ rule. Whereas the tvalue can be computed by hand, we need a statistics program to compute the pvalue.
Summary
If the tvalue is larger than a certain value (which depends on the sample size n), we conclude that there is a significant effect.
3.2.5 Type II Error: Hits, Misses
Hence, it seems that it would likewise be easy to compute the Type II error rate. However, this is not the case. When computing the Type I error, we know that the sampling distribution corresponding to the Null hypothesis is centered at one value, namely, 0. Hence, there is only one Null hypotheses. However, there are infinity many alternative hypotheses (see Implication 2e). But perhaps, we are only interested in substantial differences between the means of the North and South trees when the North trees are at least 1.2 m larger than the South trees. In this case, we know the minimal separation between the population distributions and can ask the question how large the sample size n must be to reach a significant result at least 80% of the time.
The Type II error plays an important role in terms of power. Power is the probability to obtain a significant result when indeed there is an effect, i.e., the Alternative hypothesis is true. Power is just another term for the Hit rate. The Hit rate is 1—Type II error rate. Power will be crucial in Part III and further explained in Chap. 7.
3.3 Summary
The above considerations are fundamental for understanding statistics. For this reason, we spell out the main steps once more and highlight the most important points. Even if you did not go through the above subsections, the main ideas should be understandable here.
We were interested in whether the mean height of oak trees of the North rim of the Alps is the same as for the South rim. The question is easy to answer. We just measure all trees, compute the two means, and know whether or not there is difference. If we miss a few trees, we obtain estimates, which are likely not too different from the true means. The fewer trees we measure, the larger is the sampling error, i.e., the more likely it is that our two sample means differ substantially from the two true means. As we have shown, this sampling error can be quantified by the standard error \(s_{\overline x}\), which depends on the population standard deviation σ and the sample size n. If σ is small, we need only to sample a few trees to get a good estimate of the mean. For example, if σ = 0 we need only to sample one tree from each population because all trees have the same height in each population. If σ is large, we need to sample many trees to get a good estimate of the mean.
We compute the sample means \(\overline {x}_{North}\) and \(\overline {x}_{South}\) from the n trees x_{i} we collected, we estimate the standard deviation s of the trees (see grey box), and multiply by a function (the square root) of the sample size n∕2. The right hand side shows that the tvalue is nothing else as an estimate of the effect size d multiplied with a function of the sample size. The tvalue tells us how easily we can discriminate whether or not a difference in the sample means comes from a real difference of the population means. The situation is exactly as in Chap. 2. The tvalue is nothing else as a d′ where, instead of dividing by the standard deviation, we divide by the standard error, which is a measure of the sampling error, taking both sources of noise, population variance and undersampling, into account. A large tvalue means we can easily discriminate between means and a small tvalue suggests the decision is hard. Note that a large tvalue can occur because the estimated effect size d is large, n is large, or both are large.
Assume that there is no effect, i.e., the mean height of the North and South trees is identical (δ = 0), then the pvalue tells us how likely it is that a random sample would produce a tvalue at least as big as the tvalue we just computed. Thus, if we are happy with a 5% Type I error rate and the pvalue is smaller than 0.05, we call our mean difference “significant”.
The pvalue is fully determined by the tvalue and is computed by statistics programs. Most importantly, the tvalue combines an estimate of the effect size d with the sample size (\(\sqrt {\frac {n}{2}}\)), which is why the tvalue, and hence the pvalue, confounds effect and sample size and, therefore, represents only partial information! This insight will be important to understand several implications, which we present after the following example.
3.4 An Example
Computing the pvalue is simple, as the following short example will show. Understanding the implications of the ttest is more complicated.
Results from tests like this are often summarized in a table as presented in Table 3.1. The pvalue is in the column marked “Sig. (2tailed).” In the table, degrees of freedom (df) are mentioned. The degrees of freedom are important for the computation of the pvalue because the shape of the sampling distribution is slightly different from a Gaussian for small sample sizes. In addition, one can compute the df from the sample size and viceversa. In the case of the ttest, df = n_{1} + n_{2} − 2 = 5 + 5 − 2 = 8.
As mentioned, significance does not tell you too much about your results. It is important to look and report the effect size. Cohen proposed guidelines for effects sizes for a ttest, which are shown in Table 3.2.
Take Home Messages

Since the pvalue is determined by the tvalue, it confounds effect size (d) and sample size (n). The original idea behind the ttest was to provide tools to understand to what extent a significant result is a matter of random sampling, given a certain effect size d. Nowadays, the pvalue is often mistaken as a measure of effect size, which was never intended and is simply wrong!

Partial information: proper conclusions can only be based on both the estimated population effect size, d, and the sample size, n. Hence, it is important to report both values, to take both values into account for conclusions, and to understand whether a significant result is driven by the estimated effect size d, the sample size, or both.
Output from a typical statistical software package
t  df  Sig. (2tailed)  Cohen’ s d  

Tree height  2.373  8  0.045  1.5 (large effect) 
Effect size guidelines for d according to Cohen
Small  Medium  Large  

Effect Size  0.2  0.5  0.8 
3.5 Implications, Comments and Paradoxes
For the following implications, Eq. 3.12 is crucial, because it tells us that the tvalue and, thus the pvalue, are determined by the estimated d and the sample size n.
Implications 1
Sample Size
Implication 1a
According to Eq. 3.12, if the estimated d ≠ 0, then there is always an n for which the ttest is significant. Hence, even very small effect sizes can produce a significant result when the sample size is sufficiently large. Hence, not only large effect sizes lead to significant results as one might expect, any nonzero effect size leads to significant results when n is large enough.^{6}
Implication 1b
If the estimated d ≠ 0 (and d < 4.31), then there are sample sizes n < m, for which the ttest is not significant for n but is significant for m.^{7} This pattern may seem paradoxical if you read it as: there is no effect for n but there is an effect for m. However, this is not the correct reading. We can only conclude that for m we have sufficient evidence for a significant result but insufficient evidence for n. From a null result (when we do not reject the null hypothesis) we cannot draw any conclusions (see Implication 3). We will see in Part III that this seeming paradox points to a core problem of hypothesis testing.
Implication 1c. Provocative Question
Isn’t there always a difference between two conditions, even if it is just very tiny? It seems that, except for a few cases, the difference between population means μ_{1} − μ_{2} is never really zero. How likely is it that the North and South tree means are both exactly 20.2567891119 m? Hence, we can always find a sample size n such that the experiment is significant. Why then do experiments at all?
Implications 2
Effect Size
Implication 2a
As mentioned, the pvalue is not a measure of the population effect size δ and, for each d ≠ 0, there is a n for which there is a significant outcome. Thus, small effects can be significant. According to a study, consuming fish oil daily may significantly prolong your life. However, it may prolong your life by only 2 min. Do you bother?
Implication 2b
By itself, the pvalue does not tell us about the effect size. For example, when the sample size increases (everything else equal), the pvalue decreases because the variance of the sampling distribution becomes smaller (see Fig. 3.3). Thus, if the effect size d is exactly the same, the pvalue changes with sample size.
Implication 2c
The pvalues of two experiments A and B may be exactly the same. However, one cannot draw conclusions from this fact. For example, it may be that experiment A has a large effect size d and a small sample size and vice versa for experiment B. Hence, one can never compare the pvalues of two experiments when the sample size is not the same. Likewise, a lower pvalue in experiment A than B does not imply a larger effect size. The sample size might just be higher.
Implication 2d
A study with a small sample size leading to a small pvalue indicates a higher estimated effect size than a study with a larger sample size and the same pvalue.
Implication 2e
Implications 3
Null results
Implication 3a
Absence of proof is not proof of absence: one can never conclude that there is no effect in an experiment (d = 0) when there was no significant result. A nonsignificant pvalue indicates either that there is no difference or a real difference that is too small to reach significance for the given sample size n.
Implication 3b
One might argue that with a larger sample size, the difference between the two groups might become significant. This may indeed be true. However, also the effect in the placebo group may become significant. What should we conclude? Contrary to intuition, this situation does not constitute a problem because we can ask whether or not there is a stronger effect in the Aloe Vera than the placebo condition (thus, discounting for the selfhealing).
Importantly, the example shows that it often makes very little sense to compare statements such as “there was an effect in condition A but not in condition B”. Such conclusions are ubiquitous in science and should be treated with great care (see also Chap. 7). The classic example is as in the Aloe Vera case to compare an intervention condition, where a significant results is aimed to occur, with a control condition, where a Null result is aimed for.
Implications 4
Truth, Noise and Variability
Implication 4a
This kind of model is appropriate in many fields, such as physics. However, in biology, medicine and many other fields, the situation is often quite different. For example, we might determine the strength of headaches before and after a pain killer was consumed. We find that the drug decreases headache strength on average. However, as with most drugs, there may be people who do not benefit from the drug at all. In addition, some people receive more benefit than others, i.e., for some people headaches may almost disappear while for others there is only a small (or even opposite) effect. These personspecific effects might hold for all days and situations.
In many experiments, one cannot easily disentangle v_{i} and 𝜖_{ij}. Both terms contribute to the estimated standard deviation of the population distribution, s. From a mathematical point it does not matter whether there is strong interparticipant variability or strong measurement noise. However, for interpreting the statistical analysis, the distinction is crucial. Assume there is a strong beneficial effect of the pain killer for half of the population whereas there is a smaller detrimental effect for the other half of the population. On average, the drug has a positive effect and this effect may turn out to be significant. Importantly, whereas the drug is beneficial on average, this is not true individually. For half of the population, the effect is detrimental and it is not a good idea to use the pain killer. Hence, when v_{i} is not zero, significant results do not allow conclusions on the individual level. A study may show that carrots are good for eye sight on average. Whether this is true for you is unclear. Carrots may actually deteriorate your vision, even though they help the vision of other people. These considerations do not imply that such studies are wrong, they just show the limitations of studies where v_{i} ≠ 0 for some i. For an international comparison of blood pressure values, average values are good. However, it is usually not a good idea to compare yourself to such a large group, whatever is being measured. Such a sample is not only heterogeneous across the regions but also contains people of different ages. A body mass index of 27 may be an issue for children below 5 years but not necessarily for people older than 70 years. Hence, it depends very much on the research question to what extent a mean comparison makes sense. It is a matter of interpreting statistics, not of computing statistics.
Implication 4b
The above considerations have philosophical implications. Usually, we assume that something is either the case or it is not the case. Either gravity acts on all matter in the entire universe, or it does not. Either oxygen is necessary for humans, or it is not. All of these facts hold true for each individual, i.e., for each element in the universe, for all humans, etc. If a fact has been proven by methods involving statistics, this conclusion is not necessarily justified when v_{i} is different from 0 because the results hold true only on average, not necessarily for all individuals.
Implication 4c
The variability vs. noise problem becomes even more serious when the study contains a nonhomogeneous sample differing systematically in a feature that is not explicitly considered. For example, based on how often they go to the doctor, it seems that shorter students are ill more often than taller students. However, this fact has nothing to do with body size. It is simply the case that female students are shorter than male students on average and see the gynecologist much more often than male students see the urologist. However, females see the gynecologist mainly for preventive medical checkups and are by no means more often ill than male students. Since students generally see doctors very infrequently, visits to the gynecologist weigh strongly in the statistics. It is obvious how misinterpretations can occur even in such simple examples. In more complex situations such misinterpretations are less easy to spot. By the way, one should question whether it is good idea to make conclusions about illness frequency based on doctor visits.
Implication 4d
One can also consider the variability vs. noise problem the other way around. When you are planning an experiment, you need to specify whom to include. To be representative, it is good to sample from the entire population, e.g., from all people in a country or even world wide. However, with this procedure, you may include a heterogeneous population, which makes conclusions difficult. Should you include astronauts or coma patients? What about ill people? The large portion of people with too high blood pressure? The more subpopulations you exclude, the less representative is your sample. Eventually, your sample may include only you.
Implication 4e
As a last point. Effects often depend on dosage, i.e., different people may respond differently to different dosages. A pain killer may have beneficial effects for some people in a low dosage but be detrimental for a higher dosage. Hence, there is not only systematic interperson variability but also systematic intraperson variability in addition to the unsystematic noise 𝜖_{ij}. In many experiments, there are many sources involved, i.e., effects depend on dosage, interindividual differences, and noise—limiting conclusions strongly. As we will see, dosage dependent effects are best described by correlations (Chap. 8) rather than by ttests.
Implications 5a
The Statistics Paradox and the Dangers of Cohort Studies
For large effect sizes, as they occur for example in physics, we often do not need to compute statistics. Likewise, the hypothesis that elephants are on average larger than ants does not need statistics because any living elephant is larger than any ant, δ is extremely large. The original idea of statistics was to determine whether a “bit noisy effect” really exists and to determine the sample sizes n needed to show that indeed the effect is real. We may say that statistics was developed for medium effect sizes and medium sample sizes. In the past it was usually impossible to obtain significant results with small effect sizes because data were scarce and handling large sample sizes was cumbersome. Hence, n was usually small and only experiments with large effect sizes produced significant results. This has changed largely because data collection has become cheap, and it is possible to combine and handle millions of samples as, for example, in genetics. For this reason, nowadays statistics is widely used not only for medium effects but also for very small effect sizes. However, this development is not free of danger. First of all, large sample sizes should not be confused with large effect sizes (Implication 2a). Second, conclusions are often very difficult to draw, particularly, in so called cohort studies. In cohort studies, for example, patients are compared with controls, or vegetarians are compared with meat eaters. The two groups are defined by a given label.
The difference in blood pressure between the different education groups is only about 2 mm Hg. To put this effect size into perspective, measure your blood pressure and repeat 5 min later. You will see that 2 mm Hg is very small compared to your intraperson variance (𝜖_{ij}) and very low compared to the large range of interperson variability (v_{i}). In addition, blood pressure strongly changes during activity. Maybe there is only a difference when the blood pressure is measured during rest. Maybe, maybe not. The main problem with these, so called, cohort studies is that there are too many factors that are causally relevant, but cannot be controlled for. To control for all these effects and the combinations, sample sizes may need to be larger than the number of people on the planet. In addition, is it really worth investigating 2 mm Hg? If you want to lower your blood pressure, a little bit of sport might do a good job and is much cheaper than paying thousands of dollars for education.
Implications 5b. Small Effects Sizes
As shown, studies with small effect sizes require extra care. However, small effect size are not always problematic. First, it is good to reduce the side effects of a drug that is consumed by millions of people, even if it is only by 1%. Second, many important discoveries started off with small effects; but subsequent investigations refined the methods and produced bigger effects.
Implications 5c. Conclusions
Importantly, both small and large sample sizes can be problematic. It is well known that small sample sizes are a problem because of undersampling. It is less well understood that large sample sizes may be as problematic when effect sizes are small because even tiny differences may become significant. In particular, cohort studies with small effects sizes and large sample sizes are often useless because small correlations between the investigated factor and unrelated factors can create significant results. For this reason, it is important to look at both the effect size and the sample size. Whereas the sample size n is usually mentioned, this is not always true for effect sizes. For the ttest, the effect size is often expressed as Cohen’s d (see also Chap. 4). In the following chapters, we will introduce effect sizes for other tests.
How to Read Statistics?
For different samples, the estimate of the effect d’ may vary strongly. The larger the sample size n the less variance is there and the better is the estimate. Hence, first, look whether n is sufficiently large. If so, decide whether the effect size is appropriate for your research question. Tiny effect sizes are only in some cases important and may come from confounding, unidentifiable factors. In Part III, we will see that the combination of sample size and effect size can give interesting insights into the “believability” of a study. For example, we will ask how likely it is that four experiments, each with a small sample and effect size, all lead to significant results with pvalues just below 0.05.
Take Home Messages
 1.
Even small effect sizes lead to significant results when the sample size is sufficiently large.
 2.
Do not compare the pvalue of two experiments if n is not identical: a smaller p does not imply more significance.
 3.
Statistical significance is not practical significance.
 4.
Absence of proof is not proof of absence: avoid conclusions from a Null result.
 5.
Do not pit a significant experiment against a nonsignificant control experiment.
 6.
Cohort studies with small effects are usually useless.
 7.
A statement like “X is true” can only be true for sure if intersubject variability is zero.
Footnotes
 1.
If the samples from the North and South trees are of different sizes, then the formula is \(\sigma _{{\overline x}_{North}  {\overline x}_{South}} = \sigma \sqrt {\frac {1}{n_{North}} + \frac {1}{n_{South}} }\).
 2.
Typically, the value s is computed by pooling the variances from the two samples. We describe one way of doing this pooling in Sect. 3.4.
 3.
Following the convention of SDT, we will always interpret t as being a positive number, unless we specifically say otherwise. Should the computed value be negative, one can always just switch the means in the numerator.
 4.
For small sample sizes, the t_{cv} criterion is larger because the sampling distributions are not quite Gaussian shaped. Statistical software that computes the pvalue automatically adjusts for the deviation of sampling distributions from a Gaussian shape.
 5.
Alternatively, one could identify a critical value criterion, t_{cv} = ±2.306 and note that t is farther from zero than this critical value.
 6.
We can describe the situation also as following. If there is a real effect d ≠ 0, e.g., between the tree means, we can find a sample size n, for which we obtain a significant result in almost all cases (or with a certain high probability).
 7.
If d > 4.31 you do not need to compute statistics because the difference is so large. In this case, even n = 2 leads to a significant result.
Reference
 1.Loucks EB, Abrahamowicz M, Xiao Y, Lynch JW. Associations of education with 30 year life course blood pressure trajectories: Framingham Offspring Study. BMC Public Health. 2011;28(11):139. https://doi.org/10.1186/1471245811139.CrossRefGoogle Scholar
Copyright information
Open Access This chapter is licensed under the terms of the Creative Commons AttributionNonCommercial 4.0 International License (http://creativecommons.org/licenses/bync/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.