Introduction

In traditional hypothesis testing of one sample mean, a null hypothesis indicates the absence of an effect by specifying a particular null value for the population mean, μ0. With that null hypothesis and some information about variation from a sample of data, one can predict the sampling distribution of the sample mean if the null hypothesis is true. Then, one can identify the probability that a random sample would be selected that produces a mean value more extreme than the observed mean. Typically, this probability, known as a p-value, is compared to a criterion value, α, so that if p < α, one concludes that the observed sample mean must be among the “rare” outcomes if the null hypothesis is true. Rare outcomes are unusual (by definition), so scientists infer that the null hypothesis is false. A strength of this approach is that, in ideal situations, it controls the Type I error rate (the probability of rejecting a true null hypothesis) because rare sample means that correspond to p < α occur with a probability of α.

The last decade has highlighted misuses (and abuses) of hypothesis testing (e.g., Kruschke, 2010; Simmons et al., 2011; Nuijten et al., 2016; García-Pérez, 2017). It has become clear that many scientists (perhaps unknowingly) engage in multiple testing, improper sampling, and improper reporting of results; and that these behaviors undermine the Type I error control provided by the hypothesis testing process. Recognition of these problems has highlighted long-running concerns about null hypothesis significance testing (Branch, 2014; Craig & Abramson, 2018; Gelman, 2017; Szucs & Ioannidis, 2017). One common suggestion has been to use other statistics in place of the p-value, partly because the p-value is deemed “unreliable” or too “noisy” to be of much use (e.g., Cumming, 2014). The journal Basic and Applied Social Psychology banned null hypothesis testing procedures (Trafimow & Marks, 2015) and instead now encourages reporting of descriptive statistics and standardized effect sizes. Other scientists have suggested to replace traditional hypothesis testing with Information Criterion model comparison methods (e.g., Glover & Dixon, 2004) or with Bayesian methods (e.g., Nathoo & Masson, 2016; Ortega & Navarrete, 2017; Rouder et al., 2009; Kruschke, 2010).

We see legitimate advantages to these alternative statistical approaches, and we encourage readers to investigate them and to apply them if appropriate. Nevertheless, we feel it is important to recognize the close relationships between different statistics. Francis (2017) showed mathematical equivalence for many statistics in the situation corresponding to a two-sample t-test. These statistics include: Values used for the traditional hypothesis test such as t- and p-values; standardized effect sizes, namely Cohen’s d and Hedge’s g; confidence intervals for standardized effect sizes; differences of Information Criterion calculations; and a commonly used Bayes factor. These statistics are mathematically equivalent representations of the data set’s information, and the analysis methods associated with each statistic only differ in how the data set’s information is interpreted. In this paper, we show that similar relationships hold for statistics associated with a one-sample t-test. Provided one knows the sample size, the fifteen statistics in Table 1 give the same information about the data set, and one can mathematically transform any one of the statistics into another statistic. An online app to do the conversions is provided at http://psych.purdue.edu/gfrancis/EquivalentStatistics/index_oneSample.html.

Table 1 For a known sample size of a one-sample t-test, each of these terms is a sufficient statistic for the standardized effect size of the population, δ

Equivalence of statistics

For a one-sample t-test, we have a null hypothesis, given as H0 : μ = μ0, where μ is a population mean and μ0 is a specific value, and an alternative hypothesis, denoted as HA : μμ0. (One can also consider directional alternative hypotheses.) In traditional frequentist hypothesis testing, the test statistic is derived from a sample of data as:

$$ t = \frac{\overline{X}-\mu_{0}}{s/\sqrt{n}}, $$
(1)

where \(\overline {X}\) is the mean of the sample, s is the standard deviation of the sample, and n is the sample size. We will show in the following sections how each of the other terms in Table 1 can be computed from the t-value and the sample size.

Without loss of generality, we assume a positive t-value, which we refer to as an “unsigned” t-value. This assumption does not lose generality because whether \(\overline {X}\) is greater than or less than μ0 depends on the substantive meaning of the measurements (e.g., do larger values of \(\overline {X}\) correspond to “more” or “less” of something of interest?); and the meaning is not part of the data set. In particular, one could multiply each score in a population and in a sample by -1 and thereby change the sign of the t-value without changing the statistical inferences or interpretation of the data.

p-value

The p-value is the area under the null hypothesis sampling distribution more extreme than ± t. Thus, for a given sample size, the unsigned t statistic has a one-to-one relation with the p-value; and for a given p-value it is possible to compute the corresponding unsigned t-value. The same relation holds for p-values from one-tailed tests. To properly interpret a one-tailed hypothesis test, a scientist must know that the p-value is from a one-tailed test and the direction of the observed difference, so we assume this information is known.

Standardized effect sizes

For the case of a one-sample t-test, the population standardized effect size is:

$$ \delta = \frac{\mu - \mu_{0}}{\sigma} $$
(2)

where μ is the population mean, μ0 is the value in the null hypothesis, and σ is the population standard deviation. For a set of data, the estimated standardized effect size is Cohen’s d:

$$ d = \frac{\overline{X} - \mu_{0}}{s}, $$
(3)

which uses the estimated mean and standard deviation from the sample. The statistic d is what is called a “sufficient” statistic, which means it contains all the information about δ that is available in the data set. Knowing the full data set wouldn’t provide us with any more information about δ.

It takes only a bit of algebra to see that Eqs. 1 and 3 are closely related so that:

$$ d = \frac{t}{\sqrt{n}} $$
(4)

and

$$ t = d\sqrt{n}. $$
(5)

Thus, for a known sample size, a given value of t is also a sufficient statistic of δ; and since a unique p-value corresponds to each unsigned t-value, the p-value is also a sufficient statistic of unsigned δ.

For small sample sizes, d tends to overestimate the population standardized effect size (Hedges, 1981), so sometimes scientists prefer to report Hedges’ g:

$$ g = \left( 1 - \frac{3}{4(n-1) - 1}\right) d. $$
(6)

This equation is an (excellent) approximation of a more complicated formula involving gamma functions (Hedges, 1981), which are also just a function of the sample size. Goulet-Pelletier and Cousineau (2018) provide a nice review of standardized effect sizes and their estimates. Clearly, Hedges g is also uniquely determined by the value of t (and thus of p, up to the arbitrary sign), provided the sample size is known.

Confidence interval of a standardized effect size

Each endpoint of a confidence interval for a standardized effect size has a one-to-one relationship with the standardized effect size and, thus, to t- and p-values as well. This close relationship is because the sampling distribution of a standardized effect size is a non-central t distribution that depends only on the corresponding t-value and the sample size. There is no simple formula to compute the confidence interval, but numerical techniques are easy to apply. The computation is reversible, so if the sample size is known, then one has as much knowledge about the data set by knowing just the upper limit of a confidence interval as by knowing the d, t-, or p-value of the sample. Likewise, knowing d and the sample size means that one already has sufficient information to compute the confidence interval of d.

Post hoc power

Power refers to the probability that a null hypothesis would be rejected given a specified value for the population (often given as a standardized effect size). Post hoc power supposes that the value of the population matches what was reported in an original experiment. The power value is then the probability that a replication experiment with the same sample size and design would reject the null hypothesis. Computing power requires knowing the sample size and the standardized effect size. We saw above that d (or g) can be computed from t or from p, so post hoc power can be directly computed from those terms [this relationship was previously noticed by Hoenig and Heisey (2001)]. One gets slightly different values when using d or g, but the calculation is invertible (up to the sign of t, d, and g); so knowing post hoc power for a data set allows computation of all the other statistics in Table 1.

Log likelihood ratio

One approach to statistical inference is to compute the likelihood of observed data for different models. The preferred model is then the one with the higher likelihood for the data. In the case of log likelihood ratios, the goal is to determine which of two models best fits the sample data. When dealing with a one-sample t-test, it is possible to set up models similar to the null and alternative hypotheses. For the null model, one supposes that each subject will have an observed score, denoted as Xi. This score could come from a null (sometimes called a “reduced”) model:

$$ X_{i} = \mu_{0} + \epsilon_{i}, $$
(7)

where μ0 is the null hypothesized population mean and 𝜖i is random noise from a Gaussian distribution, \(N(0, {\sigma _{R}^{2}})\) with an unknown standard deviation, which is estimated from the data. The likelihood of observed data [X1,X2,...,Xn] for the reduced (null) model is computed as

$$ L_{R} = \prod\limits_{i=1}^{n} \frac{1}{\sqrt{2\pi\hat{\sigma}_{R}^{2}}} \exp\left( -\frac{(X_{i} - \mu_{0})^{2}}{2\hat{\sigma}_{R}^{2}}\right), $$
(8)

which is the product, across all data points, of the height of the Gaussian distribution with the corresponding mean and standard deviation of the reduced/null model. Here, the estimated standard deviation is the value that maximizes likelihood for the given mean, μ0, and the data points. This calculation is commonly known as the “population” formula for standard deviation, which is a biased estimator for the population standard deviation but generates the maximum likelihood values for the given data set. Even though the calculation of a t-value uses the unbiased estimate of standard deviation while maximum likelihood calculations use biased estimates of standard deviation, we show below that there is a simple formula connecting the statistics.

The alternative (full) model is more flexible in that it allows estimation of both the standard deviation and the mean. It assumes that each data point comes from a model:

$$ X_{i} = \mu + \epsilon_{i}, $$
(9)

where 𝜖i is random noise from a Gaussian distribution, \(N(0, {\sigma _{F}^{2}})\). The likelihood of data relative to this model is:

$$ L_{F} = \prod\limits_{i=1}^{n} \frac{1}{\sqrt{2\pi\hat{\sigma}_{F}^{2}}} \exp\left( - \frac{(X_{i} - \hat{\mu})^{2}}{2\hat{\sigma}_{F}^{2}} \right), $$
(10)

where \(\hat {\mu } = \overline {X}\), the typical mean of the sample (which maximizes likelihood), and \(\hat {\sigma }_{F}\) is the standard deviation of the sample relative to the sample mean (again using the “population” formula to maximize likelihood).

To test which model best predicts the observed data, the natural logarithm is taken of a ratio of the two likelihood functions:

$$ {\Lambda} = \ln\left( \frac{L_{F}}{L_{R}}\right) = \ln\left( L_{F}\right) - \ln\left( L_{R}\right). $$
(11)

Note that the null model is a special case of (is nested within) the full model. The full model has two free parameters (mean and standard deviation) compared to the reduced model having one free parameter (standard deviation). Because of the full model’s increased flexibility, we will always find that LFLR, and thus, Λ ≥ 0. To determine which model to support based on the value of Λ, it is common practice to pick a criterion threshold, much like one does with other statistics (e.g. p-values).

The log likelihood ratio has a one-to-one relationship with the unsigned t-value (Kendall & Stuart, 1961). The log likelihood ratio for the one-sample t-test case is:

$$ {\Lambda} = \frac{n}{2} \ln\left( 1 + \frac{t^{2}}{n-1} \right). $$
(12)

Thus, we can also compute the unsigned t statistic from Λ:

$$ t = \sqrt{(n-1)\left[ \exp\left( \frac{2}{n}{\Lambda} \right) - 1\right]}. $$
(13)

Thus, Λ provides exactly the same information about a data set as the other statistics in Table 1. The relationships in Eqs. (12) and (13) are based on maximum likelihood estimates, but sometimes scientists use other statistics for computing likelihoods. In at least one such case, there is, again, a direct relationship between t-values and Λ (Cousineau & Allan, 2015) that follows a different formula.

Model selection using an information criterion

When using the log likelihood ratio to choose between models, one runs the risk of over-fitting the data by creating too complex of a model that ends up “predicting” data variation that is actually due to random noise. To quantify the risks associated with complex models, Akaike (1974) derived what is now called the Akaike Information Criterion (AIC), which takes into account the number of parameters in a model when judging which model fits the data set best. For a given model, the AIC value is:

$$ AIC = 2m - 2\ln(L), $$
(14)

where m is the number of parameters for the model and L is the likelihood of the observed data for that model. Smaller (more negative) values indicate better fit of the model to the data.

In order to determine which model fits the data better with this criterion, the AIC associated with the full (alternative) model is subtracted from the AIC associated with the reduced (null) model. For the situation corresponding to a one-sample t-test, the reduced (null) model has m = 1, while the full (alternative) model has m = 2. Thus, we compute:

$$ \begin{array}{@{}rcl@{}} &&{\Delta} AIC = AIC(\text{reduced}) - AIC(\text{full}) = -2 - 2\ln\left( L_{R}\right) \\ &&+ 2\ln\left( L_{F}\right) = -2 + 2{\Lambda} \end{array} $$
(15)

If ΔAIC > 0, then the full model is preferred, and if ΔAIC < 0, then the reduced model is preferred. Sometimes researchers require a big enough difference (e.g., ΔAIC > 2 or ΔAIC < − 2) before claiming preference for one model.

The term on the far right of Eq. 15 indicates that ΔAIC for a one-sample t-test is easily computed from Λ. We already know that Λ can be computed from t and the sample size, so ΔAIC is based on precisely the same information as a t- or p-value.

For small samples, ΔAIC tends to favor complex models (i.e. those with more parameters). Hurvich and Tsai (1989) developed a formula that further penalizes complex models with small sample sizes and many parameters:

$$ AIC_{c} = AIC + \frac{2m(m+1)}{n - m - 1}. $$
(16)

Much like ΔAIC, ΔAICc is computed by taking the difference in AICc for the reduced and full models. For the one-sample t-test case, the formula is:

$$ {\Delta} AIC_{c} = {\Delta} AIC + \frac{4}{n-2} - \frac{12}{n-3}. $$
(17)

When ΔAICc < 0, then the data favors the reduced model, and when ΔAICc > 0, the data favors the full model. As the sample size n gets large, ΔAICc will converge to ΔAIC.

Schwarz (1978) developed another criterion for model selection called the Bayesian Information Criterion (BIC). For a model with m parameters, the BIC is:

$$ BIC = m\ln(n) - 2\ln(L). $$
(18)

Much like ΔAIC, we compute the difference in the BIC by subtracting the BIC for the full model from the BIC for the reduced model. Thus, for a one-sample t-test:

$$ {\Delta} BIC = -\ln(n) + 2{\Lambda}. $$
(19)

As before, if ΔBIC > 0, then the full model is preferred, and similarly, if ΔBIC < 0, then the reduced model is preferred.

As just shown, the AICc and BIC can be computed from Λ, and Λ can be computed from t. Thus, these two model selection statistics are derived from the same information given by the other statistics in Table 1.

JZS Bayes Factor

A Bayes factor computes likelihoods for null and alternative models, much like a likelihood ratio, but it computes mean likelihood across all model parameter values defined by a prior probability distribution. Rouder et al., (2009) argued that a convenient (both in terms of computability and in meeting the needs of practicing scientists) prior is based on a Cauchy distribution for a standardized effect size. This prior distribution is derived from priors placed on other parameters that were previously suggested by other authors, so it is called a Jeffreys, Zellner, and Siow (JZS) prior. For the case of a one-sample t-test, Rouder et al. showed that such a prior results in a relatively straightforward calculation of the JZS Bayes factor that depends only on the t-value and the sample size.

The calculation is invertible (up to an unknown sign term), and thus, there is a one-to-one relation between the JZS Bayes factor, the unsigned t-value, and the p-value. The broadness of the prior for the alternative hypothesis can be easily adjusted with a scale term, r. An often reasonable default value is \(r=\sqrt {2}/2\), but the relationship between an unsigned t-value and the corresponding JZS Bayes factor holds for every scale term. Interpreting a Bayes factor (BF) requires knowledge of the properties of the prior, so the scale value should be known for any given situation.

Discussion

We have shown that, provided the sample size n is known for a one-sample two-sided t-test, then the unsigned t-value, p-value, unsigned Cohen’s d, unsigned Hedges’ g, the limits of a confidence interval of unsigned Cohen’s d or Hedges’ g, post hoc power derived from Cohen’s d or Hedges’ g, the log likelihood ratio, ΔAIC, ΔAICc, ΔBIC, and the JZS Bayes factor value give equivalent information about the data set. This equivalence means that debates over which statistic provides the “most” information about the data set should end, as they all provide the same mathematical information. Instead, progress should focus on how these statistics are interpreted. Since a p-value contains as much information about the data set as the statistics for other inferential frameworks, critiques of the p-value should either rest in the interpretation of the p-value or should apply to all of the statistics in Table 1.

Inference in different frameworks

Since all the statistics in Table 1 convey the same information about a data set, how should a scientist chose between them? One answer is simple: Share the statistic that makes it easy for the reader to understand the information. For example, if an article describes a study with n = 27 and t = 2.18, it is redundant to also say that p = 0.04, d = 0.42, and the 95% confidence interval of d is (0.02, 0.81). Nevertheless, such redundancy can be helpful to readers by making explicit what is otherwise implicit information.

Still, one would hardly expect a paper to describe all fifteen statistics in Table 1, and such reporting would generally not be appropriate because the different statistics derive meaning only within their respective inferential frameworks. Indeed, a key implication of the ability to transform the statistics in Table 1 is that various inferential frameworks are based on fundamentally different interpretations of common information. Here, we briefly summarize those interpretations for different frameworks so that readers can determine which framework is most appropriate for their particular research.

Frequentist hypothesis testing

The t- and p-values are part of traditional hypothesis testing approaches. The fundamental inference is whether to reject the null hypothesis (p < α). A key component of this decision making process is that it, in ideal settings, controls the Type I error rate (rejecting a true null hypothesis).

Standardized effect size

The d and g estimates of the standardized effect size, and their confidence intervals, provide information about the distance (in units of standard deviation) between the population mean and the value in the null hypothesis. This signal-to-noise ratio value is useful because it gives an indication of how easy it is to generate a sample that distinguishes between the presence or absence of an effect.

Power

Power is most useful for planning an experiment. There is little point in running an experiment with low power. Post hoc power (whether based on d or g values) gives an estimate of the probability that a replication study with the same sample size will produce a similar decision outcome.

Model comparison methods

Rather than focus on controlling Type I error, model comparison methods try to identify which model best fits observed data. The most straightforward method is the (log) likelihood ratio, but it has the property that a more complicated model always has a better fit to a dataset than a simpler model. This property is unappealing because the complicated model will “fit” noise as well as signal in the dataset. Such overfitting will cause the more complicated model to poorly fit new data drawn from the same population because the new data set will have a different pattern of noise.

The ΔAIC and ΔAICc statistics include complexity penalties in such a way that (in ideal situations) the preferred model will better predict new replication data. Given the importance of replication in scientific investigations (Earp and Trafimow, 2015; Zwaan et al., 2018), these statistics seem like they would be of interest to many scientists. One advantage of this approach compared to frequentist hypothesis testing is that it can provide support for the null model, compared to the alternative model.

There are some disadvantages to this approach. One disadvantage is that it does not control the Type I error rate, which many scientists feel is important. A second disadvantage is that, even with large sample sizes, the ΔAIC statistic is not guaranteed to select the correct model. Indeed, these methods tend to have a bias toward selecting models that are more complicated than reality.

The ΔBIC statistic addresses some of these concerns. In particular, if the true model is one that is being considered, ΔBIC will select it, at least for large enough sample sizes. Of course, if the true model is not among the candidates being compared, no model selection method can identify it.

Whether one prefers ΔAIC or ΔBIC partly depends on what kinds of models a scientist believes are being compared. If the scientist wants to identify a model that best predicts future data, then ΔAIC is the better choice, even though the resulting model may not be the true model (or even a good model). If the scientist believes that the true model is under consideration, then ΔBIC is the better choice because it will (for a large enough sample) almost surely be identified as the best model (Burnham & Anderson, 2004).

Bayes factors

The Bayes factor is also a model comparison statistic, but it has a different perspective than the information criterion methods. Whereas the information criterion methods use the current data to predict future data, the Bayes factor directly compares competing models with the current data. It does this by including a prior distribution of parameters that are part of the model. The prior represents uncertainty, which can be resolved by data. Similar to ΔBIC, a Bayes factor will (given a sufficiently large sample) identify the true model, if it is under consideration. In addition, being able to specify a prior allows the Bayes factor approach to compare complicated models that are not possible with ΔBIC or ΔAIC.

Relationships between inferential frameworks

Even though they have radically different inferential interpretations, all of the statistics in Table 1 are based on the very same information in the data set. Thus, for scientists not familiar with the statistics, it might be fruitful to consider how they are related to each other. We do this by considering values of statistics that correspond to commonly used inferential thresholds.

Figure 1 plots p-values as a function of sample size (n ∈{10,…,400}). Each curve plots the p-values corresponding to a commonly used criterion for other inferential frameworks. Figure 1A plots the p-values corresponding to criterion JZS BF values. The value BF = 3 is often taken to provide the minimal amount of support for the alternative hypothesis, while \(BF = \frac {1}{3}\) is often taken to provide the minimal amount of support for the null hypothesis. BF = 1 indicates equal support for either hypothesis. The JZS BF is more conservative than the p < 0.05 criterion in the sense that the p-values associated with BF = 3 for n > 10 are all less than 0.025. As sample size increases, even smaller p-values are needed to find minimal support for the alternative hypothesis. On the other hand, for \(BF = \frac {1}{3}\), we find that the associated p-values for small sample sizes are quite large; however, the p-values do decrease as sample size increase. For large samples, what might be considered as “nearly significant” p-values are treated by the Bayes Factor as evidence for the null hypothesis.

Fig. 1
figure 1

p-values that correspond to different criterion values for other inferential statistics. (A) p-values that correspond to JZS BF = 3, JZS BF = 1, and JZS \(BF = \frac {1}{3}\). (B) p-values that correspond to ΔAIC = 0 and ΔAICc = 0. (C) p-values that correspond to ΔBIC = 0, ΔBIC = 2, and ΔBIC = − 2. (D) p-values on log-log plots that correspond to the different criterion values included in (A), (B), and (C)

Figure 1B plots the p-values against ΔAIC = 0 and ΔAICc = 0, which separates support for the reduced and full models. In general, these information criterion are more lenient than using p < 0.05. Starting at n = 120, the p-value becomes bound between 0.15 and 0.16, so if a data set produces p < 0.16, then one could interpret that as providing better support for the alternative as compared to the null.

Figure 1C plots the p-value against ΔBIC = 0, ΔBIC = 2, and ΔBIC = − 2, which are common criteria that give equal support to both models (ΔBIC = 0) or provide minimal evidence for the full model (ΔBIC = 2) or reduced model (ΔBIC = − 2). ΔBIC is more lenient with small sample sizes and more strict with larger sample sizes than the typical p < 0.05 criterion.

In Fig. 1D, the p-values for the various inference criteria are plotted on a log-log graph. Much like what was found in Francis (2017) for the two-sample t-test, ΔAIC and ΔAICc are the most lenient of the different criteria for indicating support for the alternative hypothesis, and the JZS BF is the most strict.

The graphs provided in Fig. 1 were created with “minimal” criterion values in mind. Whether a method is conservative or lenient with regard to supporting the alternative hypothesis (or rejecting the null) depends on somewhat arbitrary criteria that are not an inherent part of the inferential framework. Still, we think it is interesting to notice how the criteria from one framework map onto a different framework. This mapping may help scientists better understand the relationships between the frameworks and the impact of the criteria they use.

Conclusions

Over the past years, the field of psychology has sought ways to improve data analysis, the reporting of results, and the development of theories; and one common approach is to use new methods of statistical inference. However, as Table 1 indicates, many of the statistics commonly utilized are mathematically equivalent. This equivalence does not mean that the various analysis methods are exactly the same, and there are cases where the methods may disagree with each other given the same research method. For example, a sample of n = 200 with estimated d = 0.15 corresponds to t = 2.12 and p = 0.035, which indicates that a scientist should reject the null hypothesis. Consistent with that interpretation, the AIC model comparison approach gives ΔAIC = 2.47 and ΔAICc = 2.43, which indicates that the alternative model will better predict future data than the null model. However, the JZS Bayes factor gives BF = 0.71 and ΔBIC = − 0.83, which is modest support for the null hypothesis. Thus, it is definitely not the case that all the statistics give the same answer, even though they are mathematically related to each other. Such disagreement is appropriate because the different inferential frameworks ask (and answer) different questions. When deciding how to interpret their data, scientists should carefully think about the questions being addressed by their statistics.