1 Introduction

When applying statistical testing of hypotheses to data, it is often recommended not only to report the corresponding p value, but in addition to provide a measure for the effect associated with a possible rejection of the null hypothesis, see e.g. Wilkinson [18]. Such a measure may be useful when sample sizes are to be fixed during the planning phase of a study, or when it is desired to assess the relevance of an actual rejection when given sample sizes are large. Effect size measures are strongly related to power analysis as carried out in the seminal book by Cohen [3].

A widely used measure is the so-called Cohen’s d, see also Hedges [12] and Kraemer [13], which is an effect size measure for the two-sample t test with equal variances. Consider independent samples of sizes \(n_{1}\) and \(n_{2}\) of a statistical variable y in two groups such that y follows a normal distribution with expectation \(\mu _{1}\) and variance \(\sigma ^2\) in group 1 and expectation \(\mu _{2}\) and the same variance \(\sigma ^2\) in group 2. Let t denote the usual two-sample test statistic for the null hypotheses \(H_{0}: \mu _{1} = \mu _{2}\) versus the alternative \(H_{1}: \mu _{1} \not = \mu _{2}\). As a measure for the size of an effect, Cohen [3, p. 66ff] considers the absolute value of

$$\begin{aligned} d = \frac{{\overline{y}}_{1}-{\overline{y}}_{2}}{\sqrt{\frac{s_{1}^2 + s_{2}^2}{n_{1} + n_{2} -2}}}\; , \end{aligned}$$
(1)

where \({\overline{y}}_{j}\) is the sample mean in group j, \(j=1,2\), and \(s_{j}^2 =\sum _{i} (y_{i} - {\overline{y}}_{j})^2\), where summation is carried out with respect to all observations from group j. The effect size d is related to the test statistic t by the formula

$$\begin{aligned} d = t \sqrt{\frac{n_{1}+ n_{2}}{n_{1} n_{2}}}\; , \end{aligned}$$
(2)

see (2.5.3) in Cohen [3]. According to Cohen, values \(|d|=0.2\), \(|d|=0.5\) and \(|d |=0.8\) indicate a small, medium and large effect, respectively.

It may also be of interest to have a corresponding measure when the variable y depends on further independent variables. In his Chapter 9, Cohen [3] deals with such a multiple regression situation and discusses the effect size measure \(f^2\) at length, as will further be explicated in our Section 4.

However, an analogous measure to d is rare to find; see Wilson [19, Sect. 3.14], Lipsey and Wilson [14] for such a proposal. Nonetheless, it may be of particular interest to have comparable measures of an effect size for the very same grouped variable y but additionally depending on different sets of independent variables. This is exemplarily carried out in our Section 5. In the following, we introduce such a measure as a generalization to d by considering a linear regression model

$$\begin{aligned} y = \beta _{0}+ \beta _{1} z + \beta _{2} x_{1} + \cdots + \beta _{w+1} x_{w} +\varepsilon \; , \end{aligned}$$
(3)

where z takes the value \(z_{i} = 0\) if the corresponding observation \(y_{i}\) of the dependent variable y belongs to group 1 and \(z_{i} = 1\) if \(y_{i}\) belongs to group 2, \(i= 1,\ldots , n_{1} + n_{2}\). It is assumed that there are w independent variables \(x_{1}, \ldots , x_{w}\). The error variable \(\varepsilon \) is assumed to follow a normal distribution with expectation 0 and variance \(\sigma ^2\).

As will be shown in the following Sections 2, 3, and 4, a natural generalization of Cohen’s d is given by

$$\begin{aligned} d_{*} = \frac{\overline{y_{*}}_{1}-\overline{y_{*}}_{2}}{\sqrt{\frac{s_{*1}^2 + s_{*2}^2}{n_{1} + n_{2} - 2 - w}}}, \quad s_{*j}^2 = \sum _{i} (y_{*i} - \overline{y_{*}}_{j})^2, \quad j=1,2\; , \end{aligned}$$
(4)

where \(y_{*} = y - {\widehat{\beta }}_{2} x_{1} - \cdots - {\widehat{\beta }}_{w+1} x_{w}\) is the dependent variable adjusted for the independent variables. The \({\widehat{\beta }}_{k}\) are the ordinary least squares estimates of the regression coefficients \(\beta _{k}\), \(k=2,\ldots , w+1\) in model (3). In case \(w=0\), the adjusted \(y_{*}\) coincides with the original y, so that (4) reduces to (1) and therefore can be seen as a natural generalization of Cohen’s d.

2 Partitioned Linear Regression

Let \(n= n_{1} + n_{2}\) be the total sample size. The above model (3) may also be written in vector-matrix notation as

$$\begin{aligned} y = X_{1} \delta _{1} + X_{2} \delta _{2} + \varepsilon , \quad \end{aligned}$$
(5)

where now y represents the \(n \times 1\) vector of observations of the dependent variable. Without loss of generality, it is assumed that the first \(n_{1}\) observations belong to group 1, while the last \(n_{2}\) observations belong to group 2. By introducing the notation \(1_{m}\) for an \(m\times 1\) vectors of ones, the \(n\times 2\) matrix \(X_{1}\) and the corresponding \(2\times 1\) parameter vector \(\delta _{1}\) may be written as

$$\begin{aligned} X_{1} = \begin{pmatrix} 1_{n_{1}} &{} 0\\ 1_{n_{2}} &{} 1_{n_{2}} \end{pmatrix} \quad \text {and}\quad \delta _{1} = \begin{pmatrix} \beta _{0}\\ \beta _{1} \end{pmatrix}\; . \end{aligned}$$
(6)

The \(n\times w\) matrix \(X_{2}\) contains the observations of the independent variables with corresponding regression coefficients \(\delta _{2}^{T} = (\beta _{2}, \ldots , \beta _{w+1})\), where the T superscript denotes transposition. The \(n\times 1\) random vector \(\varepsilon \) is assumed to follow a multivariate normal distribution with expectation vector 0 and variance-covariance matrix \(\sigma ^2 I_{n}\), where \(I_{n}\) stands for the \(n\times n\) identity matrix. It is assumed that the \(n\times (2+w)\) model matrix \((X_{1}, X_{2})\) has full column rank \(2 +w\). Equation (5) represents a partitioned linear regression model as considered e.g. in Fiebig et al [7]. Generalizations and further properties are investigated by Puntanen [16], Groß and Puntanen [10, 11], Ding [5], among others.

Under model (5), the ordinary least squares estimator for the parameter vector \((\delta _{1}^{T}, \delta _{2}^{T})\) is given by

$$\begin{aligned} \begin{pmatrix} {\widehat{\delta }}_{1}\\ {\widehat{\delta }}_{2} \end{pmatrix} = (X^{T} X)^{-1} X^{T} y, \quad X=(X_{1}, X_{2})\; . \end{aligned}$$
(7)

The Frisch–Waugh–Lovell theorem, see Fiebig et al [7], Lovell [15], Frisch and Waugh [8], states that

$$\begin{aligned} {\widehat{\delta }}_{2} = (X_{2}^{T} M_{1} X_{2})^{-1} X_{2}^{T} M_{1} y,\quad M_{1} = I_{n}- X_{1} (X_{1}^{T} X_{1})^{-1} X_{1}^{T}\; . \end{aligned}$$
(8)

For the specific choice (6), the matrix \(M_{1}\) becomes

$$\begin{aligned} M_{1} = \begin{pmatrix} C_{1} &{} 0\\ 0 &{} C_{2} \end{pmatrix},\quad C_{j} = I_{n_{j}} - n_{j}^{-1} 1_{n_{j}} 1_{n_{j}}^{T},\quad j=1,2\; . \end{aligned}$$
(9)

The following result is not restricted to the case (6) but remains valid in situations where the matrix \(X_{1}\) corresponds to an arbitrary set of v independent variables such that the assumptions of (5) are satisfied.

Theorem 1

Under the partitioned linear regression model (5),

$$\begin{aligned} {\widehat{\delta }}_{1} = (X_{1}^{T}X_{1})^{-1} X_{1}^{T} (y - X_{2} {\widehat{\delta }}_{2}) \end{aligned}$$
(10)

is the ordinary least squares estimator of \(\delta _{1}\).

A proof is given in the appendix. Theorem 1 means that if \({\widehat{\delta }}_{2}\) is known (e.g. computed by (8)), then the remaining parameters \(\delta _{1}\) can be estimated by regressing the adjusted

$$\begin{aligned} y_{*} = y - X_{2} {\widehat{\delta }}_{2} \end{aligned}$$
(11)

on the remaining \(X_{1}\) and this procedure just yields the identical estimate of \(\delta _{1}\) from (7).

Theorem 2

Under the partitioned linear regression model (5) and (6),

$$\begin{aligned} {\widehat{\sigma }}^2= & {} (y - X_{2} {\widehat{\delta }}_{2})^{T} M_{1} (y - X_{2} {\widehat{\delta }}_{2})/(n-2-w) \end{aligned}$$
(12)
$$\begin{aligned}= & {} (s_{*1}^{2} + s_{*2}^{2})/(n-2-w) \end{aligned}$$
(13)

is an unbiased estimator for \(\sigma ^2\).

As a matter of fact, \({\widehat{\sigma }}^2\) coincides with the usual estimator for \(\sigma ^2\) in model (5). Identity (13) follows immediately from (9) when the above \(y_{*}\) is partitioned into two vectors of length \(n_{1}\) and \(n_{2}\), respectively.

3 Testing for a group effect

From Theorem 1 with \(X_{1}\) from (6), it follows that

$$\begin{aligned} {\widehat{\delta }}_{1} = \begin{pmatrix} {\widehat{\beta }}_{0}\\ {\widehat{\beta }}_{1} \end{pmatrix}= \begin{pmatrix} n_{1}^{-1} 1_{n_{1}} &{} 0\\ - n_{1}^{-1} 1_{n_{1}} &{} n_{2}^{-1} 1_{n_{2}} \end{pmatrix} y_{*} = \begin{pmatrix} \overline{y_{*}}_{1}\\ \overline{y_{*}}_{2} - \overline{y_{*}}_{1} \end{pmatrix}\; . \end{aligned}$$
(14)

Hence, it is seen that \(|d_{*}|\) from (4) is identical to

$$\begin{aligned} |d_{*}|= \frac{|{\widehat{\beta }}_{1}|}{{\widehat{\sigma }}} \end{aligned}$$
(15)

with \({\widehat{\sigma }}\) being the square root of \({\widehat{\sigma }}^2\) from Theorem 2. The statistic \(d_{*}\) is closely related to the test statistic \(t_{*}\) (defined below) for the null hypothesis \(H_{0}: \beta _{1} = 0\) in model (5).

Theorem 3

Under the partitioned linear regression model (5) and (6) let \(M_{2} = I_{n} - P_{2}\), \(P_{2} = X_{2} (X_{2}^{T} X_{2})^{-1} X_{2}^{T}\), and let \(\gamma \) be the lower-right element of the \(2\times 2\) matrix \((X_{1}^{T} M_{2} X_{1})^{-1}\). Then, the statistic

$$\begin{aligned} t_{*} = d_{*} /\sqrt{\gamma } \end{aligned}$$
(16)

follows a central t distribution with \(n - 2-w\) degrees of freedom, provided \(\beta _{1} =0\).

In the above theorem, \(\gamma \) is the scaled variance of \({\widehat{\beta }}_{1}\), i.e. \(\text {Var}({\widehat{\beta }}_{1}) = \sigma ^2 \gamma \). See the proof of Theorem 3 in the appendix. The standard error of \({\widehat{\beta }}_{1}\) is thus \(\text {se}({\widehat{\beta }}_{1}) = {\widehat{\sigma }} \sqrt{\gamma }\) with \({\widehat{\sigma }}\) being the square root of \({\widehat{\sigma }}^2\) from Theorem 2.

Note that if \(w = 0\), then \(M_{2}=I_{n}\), and one gets

$$\begin{aligned} (X_{1}^{T} X_{1})^{-1} = \begin{pmatrix} n_{1}^{-1} &{} - n_{1}^{-1}\\ - n_{1}^{-1} &{} \gamma \end{pmatrix}, \quad \gamma = \frac{n_{1} + n_{2}}{n_{1} n_{2}}\; , \end{aligned}$$
(17)

and hence

$$\begin{aligned} d_{*} = t_{*} \sqrt{\frac{n_{1} + n_{2}}{n_{1} n_{2}}}\; , \end{aligned}$$
(18)

which is just a reformulation of (2). These considerations show that \(d_{*}\) is a natural extension of Cohen’s d in the context of additional independent variables.

4 Effect Size in Multiple Regression

In his Chapter 9, Cohen [3] discusses the effect size measure \(f^{2}\) based on the F test of a linear hypothesis. It may be applied when \(X_{1}\) does not only comprise intercept and one dummy as under model (5), but a total of u independent variables of arbitrary type. Then, it might be of interest to measure the effect size of the set of variables in \(X_{1}\) given the set in \(X_{2}\), which is Cohen’s case 1. Cohen suggests values \(f^2=0.02\), \(f^2=0.15\) and \(f^2=0.35\) for a small, medium and large effect, respectively. Since the measure \(d_{*}\) refers to one dummy (\(u=1\)), one might expect a relationship between \(d_{*}\) and the corresponding \(f^2\). Actually, as noted in our Remark below, such a relationship can be specified.

The measure \(f^2\) for Cohen’s case 1 is given by

$$\begin{aligned} f^2 = F \frac{u}{v},\quad v = n-u-w- 1\; , \end{aligned}$$
(19)

where under model (5) F is the F statistic for testing the null hypothesis \(H_{0}: \beta _{1}=0\). From (9.2.3) in Cohen [3],

$$\begin{aligned} f^2 = \frac{R^2 - R_{0}^2}{1 -R^2}\; , \end{aligned}$$
(20)

where \(R^2\) is the coefficient of determination from model (5) and \(R_{0}^2\) is the coefficient of determination in the reduced model with \(\beta _{1}=0\), admitting model matrix \(X_{0}=(1_{n}, X_{2})\). If P denotes the orthogonal projector onto the column space of the model matrix of a regression model with intercept, the coefficient of determination is given by

$$\begin{aligned} R^2 = 1 - \frac{y^{T} (I_{n} - P) y}{y^{T} C y} \end{aligned}$$
(21)

with \(C= I_{n} - n^{-1} 1_{n} 1_{n}^{T}\) being the so-called centering matrix, e.g. see [9, Sect. 6.2]. From this, (20) becomes

$$\begin{aligned} f^2 = \frac{y^{T}(P - P_{0}) y}{y^{T} (I_{n} - P) y} \end{aligned}$$
(22)

with \(P=X(X^{T} X)^{-1} X^{T}\), \(X=(X_{1}, X_{2})\), and \(P_{0} = X_{0} (X_{0}^{T} X_{0})^{-1} X_{0}^{T}\). In view of \(\text {rank}(P-P_{0}) =1\) and \(\text {rank}(I_{n} - P) = n - (2 + w)\), the corresponding F statistic reads

$$\begin{aligned} F = \frac{y^{T}(P - P_{0}) y/\text {rank}(P-P_{0})}{y^{T} (I_{n} - P) y/\text {rank}(I_{n} - P)}\; . \end{aligned}$$
(23)

Then, from Theorem 3.2.1 (ii) in Christensen [2], F follows a central F distribution with 1 and \(n-2-w\) degrees of freedom, provided \(\beta _{1}=0\).

Now, it is well known and readily verified that the squared t statistic for the null hypothesis \(H_{0}: \beta _{1}=0\) is identical to the test statistic of the F test for the very same hypothesis. Thus, by combining (16) and (19), the following is true.

Remark

The identity

$$\begin{aligned} f^2 = \frac{d_{*}^2 /\gamma }{n-2-w} \end{aligned}$$
(24)

specifies the exact relationship between the effect size measures \(f^2\) and \(d_{*}\) from above.

Note that in case \(w=0\), the above identity reads

$$\begin{aligned} f^2 = d^{2} \frac{n_{1} n_{2}}{n (n-2)}\; , \end{aligned}$$
(25)

which slightly differs from formula (9.3.5) in Cohen [3] and lacks some of its beauty. Since (25) is expected to coincide with (9.3.5), this reveals a fallacy in the latter formula. Formula (25) may also be verified independently by directly assuming model (5) without any additional independent variables \(X_{2}\). In most cases, actual computations of the two formulas in question only differ in a later digit after the decimal point (say the fifth or sixth), so the difference usually has no practical meaning. The correctness of (24) and (25) is additionally confirmed by applications to real data.

5 Data Example

To give a possible outline for applications and an illustration of the previous formulas, we employ a data set available from the UCI machine learning repository, see Dua and Graff [6]. It contains student achievement in secondary education of two Portuguese schools, see Cortez and Silva [4]. In the following, computations are carried out with the statistical software R [17].

Fig. 1
figure 1

Frequency of final results (variable G3) in Portuguese language of \(n=649\) students

As the dependent variable y, we consider the final grade with integer values ranging between 0 and 20 in Portuguese language (variable G3) of \(n_{1} = 383\) female and \(n_{2} = 266\) male students. The dummy variable z takes values 0 for female and 1 for male. As also indicated by Figure 1, female students perform better with an average of \({\overline{y}}_{1} = 12.25326\) compared to \({\overline{y}}_{2} = 11.40602\) for male students. The corresponding equal variances two-sample test statistic admits \(t=3.310938\) with p value 0.0009815287. Although this implies strong significance, the corresponding effect size from (2) reads \(d = 0.264261\), thereby indicating only a slightly more than low effect. This value may also be obtained by function cohens_d from the R package effectsize, see Ben-Shachar et al [1].

As additional independent variables, we consider the education of the father (Fedu) \(x_{1}\) and the travel time from home to school (traveltime) \(x_{2}\). Both variables are measured on an ordinal scale with integer values ranging from 0 to 4 and 1 to 4, respectively, and are included as quantitative variables in our regression approach, implying \(w=2\).

Table 1 Extract from R output by fitting the complete model (5) with function lm, where variable sex is included as a factor, implying that ZM represents the dummy z

Least squares estimates of the coefficients with corresponding t statistic values are given in Table 1. The intercept estimate \({\widehat{\beta }}_{0}=11.41385\) is the average of the adjusted final grade \(y_{*}\) of females, while the dummy variable estimate \({\widehat{\beta }}_{1}=-0.9406209\) is the difference between the average of \(y_{*}\) in the male group minus the average in the female group as given in 14. As it is also seen, better father’s education comes along with better grades (positive \({\widehat{\beta }}_{2}\)) while longer travel times to school come along with lower grades (negative \({\widehat{\beta }}_{3}\)) when the other variables are held constant, respectively. The coefficient of determination from this model reads \(R^2 = 0.07238847\), while for the reduced model (omitting sex) it is \(R_{0}^2 = 0.05207054\). Then, \(f^2\) from (20) is \(f^2 = 0.0219035\), implying a slightly more than small effect concerning the difference between female and male final grades, ceteris paribus.

The relationship (24) may also be used to infer Cohen’s \(d_{*}\) from \(f^2\). For this

$$\begin{aligned} (X_{1}^{T} M_{2} X_{1})^{-1} = \begin{pmatrix} 0.019062813 &{} -0.001591796\\ -0.001591796 &{} 0.006438624 \end{pmatrix}\; , \end{aligned}$$
(26)

the lower-right element being the scaled variance \(\gamma \) of \({\widehat{\beta }}_{1}\). With \({\widehat{\sigma }} = 3.118756\) it follows \(d_{*}= 0.3016013\). This indicates a slightly stronger effect when variables \(x_{1}\) and \(x_{2}\) are held constant than seen before from \(d = 0.264261\) not considering any additional independent variables at all. Alternatively, \(d_{*}\) may be computed by either of the formulas (15) or (4) yielding the very same absolute value.