1 Introduction

Consider independent samples of sizes \(n_{0}\) and \(n_{1}\) of a statistical variable Y in two groups such that Y follows a normal distribution with expectation \(\mu _{0}\) and variance \(\sigma ^2\) in group ”0” and expectation \(\mu _{1}\) and the same variance \(\sigma ^2\) in group ”1”. As an estimator for the size d of the group effect, Cohen (1988, p. 66ff) considers

$$\begin{aligned} {\widehat{d}} = \frac{{\overline{Y}}_{1}-{\overline{Y}}_{0}}{S}, \quad S =\sqrt{\frac{Q_{0} + Q_{1}}{n_{0} + n_{1} - 2}}\;, \end{aligned}$$
(1)

where \({\overline{Y}}_{0}\) and \({\overline{Y}}_{1}\) are the sample means and \(Q_{0}\) and \(Q_{1}\) are the sums of squared differences from the respective sample means in the two groups. The estimator \({\widehat{d}}\) is related to the test statistic t of the usual two-sample test statistic for the null hypotheses \(H_{0}: \mu _{0} = \mu _{1}\) versus the alternative \(H_{1}: \mu _{0} \not = \mu _{1}\) by the formula

$$\begin{aligned} {\widehat{d}} = t \, \sqrt{\frac{2}{{\widetilde{n}}}}\;, \end{aligned}$$
(2)

see (2.5.3) in Cohen (1988). Here, \({\widetilde{n}}\) denotes the harmonic mean of the two numbers \(n_{0}\) and \(n_{1}\), so that \(2/{\widetilde{n}} = (n_{0}+n_{1})/(n_{0} \cdot n_{1})\). Cohen suggests values \(|{\widehat{d}}| =0.2\), \(|{\widehat{d}}|=0.5\) and \(|{\widehat{d}}|=0.8\) as an indication for a small, medium and large effect, respectively.

When the variable Y of interest is considered to possibly depend on a number of explanatory variables, one may consider a linear regression model described by

$$\begin{aligned} Y = \beta _{0} + \beta _{1} X_{1} + \beta _{2} X_{2} + \cdots + \beta _{1+k} X_{1+k} + \varepsilon , \end{aligned}$$
(3)

where \(\varepsilon \) follows a normal distribution with expectation 0 and variance \(\sigma ^2 >0\). The grouping variable \(X_{1}\) takes 0 as a value when a a response observation falls into group “0” and 1 when an observation falls into group “1”. There are k further explanatory variables \(X_{2}, \ldots , X_{1+k}\), which are assumed as absent in the simple case \(k=0\).

From Eq. (3) the expectation of Y conditional on the group is given as

$$\begin{aligned} \mu _{0} = \text {E}[Y|X_{1} = 0] = \beta _{0} + \beta _{2} X_{2} + \cdots + \beta _{1+k} X_{1 + k} \end{aligned}$$
(4)

in group ”0” and \(\mu _{1} = \text {E}[Y|X_{1} = 1] = \text {E}[Y|X_{1} = 0] + \beta _{1}\) in group ”1”. Hence,

$$\begin{aligned} d = \frac{\mu _{1} - \mu _{0}}{\sigma } = \frac{\beta _{1}}{\sigma } \end{aligned}$$
(5)

is the unknown population effect size, see Cohen (1988, (2.5.1)).

Recently, Groß and Möller (2023) considered a natural generalization of Cohen’s estimator from (2) based on the above outlined properties in the regression setting with additional explanatory variables. This estimator is given by

$$\begin{aligned} {\widehat{d}} = \frac{{\widehat{\beta }}_{1}}{\sqrt{{\widehat{\sigma }}^2}}\;, \end{aligned}$$
(6)

where \({\widehat{\beta }}_{1}\) is the least squares estimator of \(\beta _{1}\) and \({\widehat{\sigma }}^2\) is the usual unbiased estimator for \(\sigma ^2\) under model (3). In matrix notation the model may also be written as

$$\begin{aligned} {\varvec{Y}} = {\varvec{X}} {\varvec{\beta }} + {\varvec{\varepsilon }} \end{aligned}$$
(7)

where \({\varvec{Y}}\) represents the \(n \times 1\) vector of observations of the explained variable Y, the \(n\times (2 +k)\) model matrix \({\varvec{X}}\) is assumed to be of full column rank, and \({\varvec{\varepsilon }}\) follows a n-variate normal distribution with expectation vector \({\varvec{0}}\) and variance-covariance matrix \(\sigma ^2 {\varvec{I}}_{n}\) with \({\varvec{I}}_{n}\) denoting the \(n\times n\) identity matrix. Generalizations of this setting are reviewed e.g. by Groß (2004). Under model (7),

$$\begin{aligned} {\widehat{\beta }}_{1} = {\varvec{e}}^{\prime } \widehat{{\varvec{\beta }}}, \quad \widehat{{\varvec{\beta }}} = ({\varvec{X}}^{\prime } {\varvec{X}})^{-1} {\varvec{X}}^{\prime } {\varvec{Y}}\;, \end{aligned}$$
(8)

where \({\varvec{e}}\) is a \(n\times 1\) vector of 0s except for a 1 at the position of \(\beta _{1}\) in the \((2+k)\times 1\) parameter vector \({\varvec{\beta }} =(\beta _{0}, \beta _{1}, \beta _{2}, \ldots , \beta _{k})^{\prime }\) with \(\beta _{2}\) up to \(\beta _{k}\) considered as absent in case \(k=0\). Moreover,

$$\begin{aligned} {\widehat{\sigma }}^{2} = m^{-1} ({\varvec{Y}} - {\varvec{X}} \widehat{{\varvec{\beta }}})^{\prime }({\varvec{Y}} - {\varvec{X}} \widehat{{\varvec{\beta }}}), \quad m = n - 2 - k\;. \end{aligned}$$

In this setting, as noted by Groß and Möller (2023), both formulas (6) and (2) yield identical estimates for the special case of \(k=0\).

In the following Sect. 2 we state some theoretical results referring to and also generalizing known results in the literature specifically with respect to unbiased estimation of Cohen’s d in the presence of covariates. Section 3 then provides a concise workflow for application of the proposed method on the basis of a publicly available data set.

2 Statistical properties of Cohen’s d

Consider the two quantities

$$\begin{aligned} v_{1}^2 = \sigma ^{-2} \, \text {var}({\widehat{\beta }}_{1}) = {\varvec{e}}^{\prime } ({\varvec{X}}^{\prime } {\varvec{X}})^{-1} {\varvec{e}} \quad \text {and}\quad \tau = \frac{\beta _{1}}{\sqrt{\sigma ^2 v_{1}^{2}}} = \frac{d}{\sqrt{v_{1}^{2}}}\;. \end{aligned}$$
(9)

For example from Seber and Lee (2003, Theorem 3.5), it is seen that the random variable \(X = {\widehat{\beta }}_{1}/\sqrt{\sigma ^2 v_{1}^{2}}\) follows a normal distribution with expectation \(\tau \) and variance 1, and the random variable \(Y = m {\widehat{\sigma }}^{2}/\sigma ^2\) independently follows a (central) chi squared distribution with m degrees of freedom. Then, the ratio \(X/\sqrt{Y/m} = {\widehat{d}}/\sqrt{v_{1}^{2}}\) follows a non-central t distribution, see e.g. Johnson and Welch (1940). Thus, one may state the following.

Proposition 1

Under the assumptions of model (7),

$$\begin{aligned} \frac{{\widehat{d}}}{\sqrt{v_{1}^{2}}} \sim t(m, \tau )\;, \end{aligned}$$
(10)

where \(t(m, \tau )\) denotes the non-central t distribution with m degrees of freedom and non-centrality parameter \(\tau \).

In case \(k=0\) the model matrix \({\varvec{X}}\) becomes

$$\begin{aligned} {\varvec{X}} = \begin{pmatrix} {\varvec{1}}_{n_{0}} &{} {\varvec{0}}\\ {\varvec{1}}_{n_{1}} &{} {\varvec{1}}_{n_{1}} \end{pmatrix}\;, \end{aligned}$$
(11)

where \({\varvec{1}}_{\nu }\) denotes the \(\nu \times 1\) vectors of 1s, and a little matrix algebra reveals

$$\begin{aligned} v_{1}^{2} = \frac{n_{0} + n_{1}}{n_{0} n_{1}} = \frac{2}{{\widetilde{n}}}\;. \end{aligned}$$
(12)

Then Proposition 1 is easily seen to reduce to the result given by Hedges (1981, Sect. 3).

From Johnson and Welch (1940), the expectation and variance of the \(t(m, \tau )\) distribution are given as

$$\begin{aligned} \mu _{1} = c(m) \tau \quad \text {and}\quad \mu _{2} = \frac{m}{m-2} + \left( \frac{m}{m-2} - c(m)^2\right) \tau ^2\;, \end{aligned}$$
(13)

where

$$\begin{aligned} c(m) = \frac{\Gamma ((m-1)/2)\sqrt{m/2}}{\Gamma (m/2)}\;. \end{aligned}$$
(14)

Remark 1

From Tricomi and Erdélyi (1951) one may conclude

$$\begin{aligned} c(m) = 1 + \frac{3}{4 m} + O(m^{-2}) \quad \text {as }m \rightarrow \infty \;, \end{aligned}$$
(15)

implying that \(\lim _{m\rightarrow \infty } c(m) = 1\).

From Remark 1, the number \(1 + 3/(4 m)\) may serve as a rough approximation of c(m), which, however, is less precise than the proposal

$$\begin{aligned} c(m) \approx \left( 1- \frac{3}{4m-1}\right) ^{-1} \end{aligned}$$
(16)

from Hedges (1981), see also Table 2 in Goulet-Pelletier and Cousineau (2018) for a comparison of exact values with corresponding approximations. As another approximation not further investigated here one may consider \(c(m)\approx \sqrt{2 m/(2m-3)}\) for larger m, which may be concluded from Theorem 2.1 in Laforgia and Natalini (2012).

Now, combining Proposition 1 with (13) and noting \(d = \tau \sqrt{v_{1}^{2}}\) gives the following.

Proposition 2

Under the assumptions of model (7) with \(m = n - 2 - k > 2\),

$$\begin{aligned} \text {E}({\widehat{d}}) = c(m) d \quad \text {and} \quad \text {Var}({\widehat{d}}) = \frac{m }{m-2} v_{1}^{2} + \left( \frac{m}{m-2}- c(m)^2\right) d^2 \; \end{aligned}$$
(17)

for \({\widehat{d}}\) from (6).

When considering the asymptotic behaviour of \({\widehat{d}}\) for increasing number of observations it is reasonable to assume that the group size proportion remains constant, i.e.

$$\begin{aligned} n_{0} = \gamma n\quad \text {and}\quad n_{1} = (1-\gamma ) n \end{aligned}$$
(18)

for some \(0< \gamma <1\) and any positive integer n. Then, letting n approach \(\infty \) is equivalent to letting m approach \(\infty \), provided the number k of additional independent variables does not depend on the number of observations.

Remark 2

From the above Proposition 2 and Remark 1, it follows that \({\widehat{d}}\) is asymptotically unbiased for d, i.e. \(\lim _{m\rightarrow \infty } \text {E}({\widehat{d}}) = d\). If \(\lim _{m\rightarrow \infty } \text {var}({\widehat{\beta }}_{1}) = 0\), then \(\lim _{m\rightarrow \infty } \text {var}({\widehat{d}}) = 0\), in which case \({\widehat{d}}\) is consistent in quadratic mean for d under model (7).

From (12) it is easily seen that the assumption \(\lim _{m\rightarrow \infty } \text {var}({\widehat{\beta }}_{1}) = 0\) is satisfied when \(k=0\). The consistency of \({\widehat{d}}\) in this case has already been noted by Hedges (1981, p. 112).

From Proposition 2 an unbiased estimator for d is provided by \({\widehat{d}}_{u} = c(m)^{-1} {\widehat{d}}\), which has also been called Hedges’ g when \(k=0\), cf. Hedges (1981). A corresponding standard error may be derived from Proposition 2 by considering the square root of \(\text {var}({\widehat{d}}_{u})\) with d replaced by \({\widehat{d}}_{u}\).

Remark 3

An unbiased estimator for the parameter d is given by \({\widehat{d}}_{u} = c(m)^{-1} {\widehat{d}}\) with corresponding standard error

$$\begin{aligned} \text {se}({\widehat{d}}_{u}) = \sqrt{\frac{m\, c(m)^{-2}}{m-2} v_{1}^{2} + \left( \frac{m\, c(m)^{-2}}{m-2}- 1\right) {{\widehat{d}}_{u}}^2}\;. \end{aligned}$$
(19)

As also noted in Goulet-Pelletier and Cousineau (2018), unbiased estimation is to be preferred over biased estimation, but for large m the difference between \({\widehat{d}}\) and \({\widehat{d}}_{u}\) is quite small.

From Proposition 1 it is possible to construct a confidence interval for the parameter \(\tau \) defined in (9) by applying the inversion confidence interval principle from Proposition 2 in Steiger and Fouladi (1997). For this, let \(F(\tau ) \equiv \text {Pr}((-\infty , {\widehat{d}}/\sqrt{v_{1}^{2}}])\) be the cumulative distribution function of the \(t(m,\tau )\) distribution with \(m=n-2 - k\) degrees of freedom at \({\widehat{d}}/\sqrt{v_{1}^{2}}\), considered as a function of the non-centrality parameter \(\tau \). For a specified \(\alpha \) with \(0< \alpha < 1\) let \(\tau _{1}\) satisfy \(F(\tau _{1}) = 1-\alpha /2\) and let \(\tau _{2}\) satisfy \(F(\tau _{2}) = \alpha /2\). Then the interval \([\tau _{1}, \tau _{2}]\) specifies a \((1-\alpha )\) confidence interval for \(\tau \). Hence we may state the following.

Remark 4

Let \(\tau _{1}\) and \(\tau _{2}\) be obtained as described above. Then the interval

$$\begin{aligned} \left[ \tau _{1}\sqrt{v_{1}^{2}},\, \tau _{2} \sqrt{v_{1}^{2}}\right] \end{aligned}$$

specifies a \((1-\alpha )\) confidence interval for the parameter d.

Note that this approach has also been illustrated in Example 3 by Steiger and Fouladi (1997) for the special case \(k=0\).

3 Data example

To illustrate and apply the above listed properties, a data set available from the UCI machine learning repository is employed, see Dua and Graff (2017). It contains student achievement measurements in secondary education of two Portuguese schools, see Cortez and Silva (2008). The following computations are carried out with the statistical software R (R Core Team 2023).

Fig. 1
figure 1

Frequency distribution of final Math results (variable G3) from \(n=395\) students

We consider the final Mathematics grade with integer values between 0 and 20 as the dependent variable Y from \(n=395\) students in two groups. The group 0 is defined by home address indicated as ‘rural’ with \(n_{0} = 307\) observations, while group 1 is defined by home address indicated as ‘urban’ with \(n_{1} =88\) observations, see Fig. 1. On average, students from group 1 perform better (\({\overline{Y}}_{1} = 10.674267\)) than students form group 0 (\({\overline{Y}}_{0} = 9.511364\)). The corresponding two-sample two-sided t test statistic with equal variances reads \(|t| = 2.1084\) admitting a p-value of 0.03563. Hence, one may conclude a significant difference at significance level 0.05.

Cohen’s d estimator from (2) reads

$$\begin{aligned} {\widehat{d}} = 2.1084 \sqrt{\frac{307 + 88}{307 \cdot 88}} = 0.2549\;, \end{aligned}$$
(20)

indicating a rather small effect. The very same value may also be obtained from the R package effectsize, see Ben-Shachar et al. (2020), except for an opposite sign. This is supposed to originate from using the difference \({\overline{Y}}_{0} - {\overline{Y}}_{1}\) instead of \({\overline{Y}}_{1} - {\overline{Y}}_{0}\) in the involved formulas. As a matter of fact the t test statistic in R is computed by applying the first difference, while the regression coefficient \({\widehat{\beta }}_{1}\) is identical to the second difference for the special case of \(k=0\). In the general regression context with \(k>0\) the coefficient \({\widehat{\beta }}_{1}\) is the estimated positive or negative increment of the intercept in group 1 compared to group 0 conditional on the independent variables – and may even receive an opposite sign to the unconditional mean difference \({\overline{Y}}_{1} - {\overline{Y}}_{0}\).

From fitting the model \(Y= \beta _{0} + \beta _{1} X_{1} + \varepsilon \) one gets \({\widehat{\beta }}_{1} = 1.162903\) and \(\sqrt{{\widehat{\sigma }}^{2}} = 4.561543\) yielding the same estimate \({\widehat{d}}\) by (6) as before. By using the approximation \(c(n-2) \approx 1.001913\) from (16) one gets \({\widehat{d}}_{u} = c(n-2) {\widehat{d}} = 0.2544\). Except for the sign, this is also exactly the value of Hedges’ g computed from the package effectsize.

Now two additional independent variables are considered. The home to school travel time (incorporated as a discrete variable \(X_{2}\) with values 1 to 4 corresponding to travel times less that 15 min, 15 to 30 min, 30 to 60 min and more than 60 min) and the number of past class failures (incorporated as a discrete variable \(X_{3}\) with values from 0 to 4 where 4 is noted for more than 3 failures). Then from fitting a model

$$\begin{aligned} Y = \beta _{0} + \beta _{1} X_{1} + \beta _{2} X_{2} + \beta _{3} X_{3} + \varepsilon \end{aligned}$$
(21)

one gets

$$\begin{aligned} {\widehat{\beta }}_{1} = 0.6212941,\quad v_{1}^{2} = 0.01642807,\quad \sqrt{{\widehat{\sigma }}^2} = 4.265321\;, \end{aligned}$$
(22)

where \(v_{1}^2\) is computed as before from \(({\varvec{X}}^{\prime } {\varvec{X}})^{-1}\) on the \(395\times 4\) model matrix \({\varvec{X}}\). As noted by Groß and Möller (2023), the estimated regression coefficient \({\widehat{\beta }}_{1}\) is also identical to the group mean difference \({\overline{Y}}_{*1} - {\overline{Y}}_{*0}\) when a new variable \(Y_{*} = Y - {\widehat{\beta }}_{2}X_{2} - {\widehat{\beta }}_{3}X_{3}\) is created. This does, however, not imply that the classical effect size formulas may simply be applied with Y replaced by \(Y_{*}\), since that would ignore an additional required adjustment for the degrees of freedom, see formula (4) in Groß and Möller (2023). The (biased) effect size estimate then reads

$$\begin{aligned} {\widehat{d}} = \frac{{\widehat{\beta }}_{1}}{\sqrt{{\widehat{\sigma }}^2}} = 0.1456617 \end{aligned}$$
(23)

implying a very small net group effect size when home school travel time and number of past failures are held constant. As noted by Groß and Möller (2023), this value may also be converted to Cohen’s \(f^2\) as

$$\begin{aligned} {\widehat{f}}^2 = \frac{{{\widehat{d}}}^{2}}{m\, v_{1}^{2}} = 0.003303145\;,\quad m = n -2 -k = 391\;, \end{aligned}$$
(24)

being in line with the indication of a very small effect size. The approximation (16) gives \(c(m)\approx 1.001923\). Then, by Remark 3

$$\begin{aligned} {\widehat{d}}_{u} = 0.1453822, \quad \text {se}({\widehat{d}}_{u}) = 0.1283604\;. \end{aligned}$$
(25)

The usual approximate normal \(95\%\) confidence interval \({\widehat{d}}_{u} \pm 1.96 \,\text {se}({\widehat{d}}_{u})\) reads

$$\begin{aligned}{}[-0.1062043, \, 0.3969686]\;. \end{aligned}$$
(26)

To obtain a 95% confidence interval by the inversion principle, the cumulative distribution function \(F(\tau )\) of the \(t(m, \tau )\) distribution with \(m=391\) is considered at \({\widehat{d}}/\sqrt{v_{1}^{2}} = 1.136455\). Then \(F(\tau _{1}) = 0.975\) for \(\tau _{1} = -0.8258511\) and \(F(\tau _{2}) = 0.025\) for \(\tau _{2} = 3.0973105\). Remark 4 yields

$$\begin{aligned}{}[-0.1058510, \, 0.3969886] \end{aligned}$$
(27)

as the corresponding \(95\%\) confidence interval for d, quite similar to the above.

4 Conclusion

As illustrated above, a linear regression model may be applied to obtain the net effect size of a group difference with respect to a response variable of interest when other variables are held constant. The size of the group difference effect with respect to each of the incorporated additional variables, however, naturally remains unaccounted for. Nonetheless the above remarks show that some classical results on Cohen’s d carry over within a more general regression framework by (a) applying the estimator from (6), (b) replacing the number of the degrees of freedom \(n-2\) by \(n-2 -k\), with k being the number of additional independent variables, and (c) replacing the quantity \(2/ {\widetilde{n}} = (n_{0} + n_{1})/(n_{0} \cdot n_{1})\) by \(v_{1}^2\), the latter being the diagonal element of the matrix \(({\varvec{X}}^{\prime } {\varvec{X}})^{-1}\) corresponding to the group 1 regression coefficient. The results fit in between the classical effect size measure for the (unconditional) difference in two groups and the effect size measure \(f^2\) usually considered within an even more general regression context.