Introduction

Researchers in psychology nowadays are encouraged to report effect-sizes, replicate studies, and think meta-analytically (e.g., American Psychological Association, 2010, p. 34; Cooper & Patall, 2009; Cumming, 2013; Smithson, 2003, pp. 12-16). These reforms are laudable and long overdue. Nonetheless, they open up some largely undebated issues regarding moderation (and replication) of effect-sizes across samples and across studies. If these issues are ignored, then researchers may fall prey to difficulties in establishing when an effect has been moderated in a single study, when two or more studies can be said to be “replications” of one another, or whether a collection of studies’ effect-sizes is heterogeneous or not.

These difficulties arise because under commonly-occurring conditions, popular alternative effect-size measures in ANOVA and multiple regression are moderated differently across independent samples. Effects may appear to be unmoderated according to one effect-size measure but not according to another, or may even be moderated in opposite directions. Moderator effects are bread-and-butter in many areas of psychology, so differential effect-size moderation has important ramifications for research practice, reporting, replication, and meta-analysis. In this paper we address the following questions:

  1. 1.

    Under what conditions are alternative appropriate effect-size measures moderated differently?

  2. 2.

    How can we detect such differences?

  3. 3.

    How can we interpret differential effect-size moderation?

We begin by observing that when means are compared between two independent samples, Cohen’s d is the conventional effect-size employed, but in principle, either Cohen’s d or η 2 may be used. For a constant total sample size, Cohen’s d is partly determined by the inequality between cell sizes whereas η 2 is not. We describe the conditions, and provide examples, where one measure is moderated but the other is not, and where they are moderated in opposite directions.

We then turn to a well-documented but often ignored distinction between moderation of the “degree” and moderation of the “form” of the relationship between an independent variable (IV) and a dependent variable (DV) (Arnold, 1982; DeShon & Alexander, 1996 Zedeck, 1971). In regression, “degree” moderation refers to moderation of association (i.e., correlation) and “form” moderation to moderation of slopes. In ANOVA, “degree” can refer to the differences between means and “form” to Cohen’s d. In another terminology, degree and form correspond to the moderation of standardized (scale-independent) and unstandardized (scale-based) effect-size measures.

Researchers may legitimately be interested only in moderation of degree or in moderation of form, or both. Arnold (1982) refers to debates among industrial and educational psychologists regarding “test fairness,” where “fairness” has at least two meanings. In one sense, a test is fair for all subpopulations (e.g., males and females) if its validity (the correlation between the test score and a criterion variable) is the same for these subpopulations. In another sense, a test is fair if a unit change in the criterion yields the same expected change in test score for all subpopulations. The first sense refers to degree, and the second to form. Arnold’s point is that each of these notions of fairness addresses a different kind of moderation.

However, the default assumption by researchers is that degree and form are moderated in the same way. In linear regression (and ANOVA), homoscedasticity (or homogeneity of variance) guarantees that this will be true. However, heteroscedasticity (or heterogeneity of variance, HeV) forces the two kinds of moderator effects to differ from one another. Importantly, HeV in the independent variable (IV) can do this as well as HeV in the dependent variable (DV). Researchers seldom test for HeV in the IV, so most probably are unaware of this manifestation of the phenomenon. We describe the conditions under which form is moderated when degree is not (and vice-versa), and when they are moderated in opposite directions. We also discuss the important but often overlooked role of moderated scale reliability in generating HeV. This part of our paper overlaps with Smithson’s (2012) treatment. However, that paper restricted its discussion to simple regression and ANOVA, i.e., the moderation of the effect of just one predictor. Our treatment extends the scope to include multiway ANOVA and multiple regression.

In multiway ANOVA and multiple regression, another important consideration about moderation effects needs to be taken into account, namely when a moderator variable affects more than one relationship between variables. We show that three popular alternative effect-size measures, semi-partial η 2 (a.k.a. semi-partial R 2), partial η 2 (a.k.a. partial R 2), and the standardized regression coefficient, may be moderated differently when other moderator effects are present. Importantly, the relevant moderator effects are not limited to moderations of the relationships between other predictors and the DV. Instead, they also include moderation of the relationships between other predictors and the IV whose relationship with the DV is under consideration, i.e., moderation of that IV’s tolerance. Again, most researchers seem unaware or heedless of these phenomena. Indeed, to our knowledge, no systematic or comprehensive account of this issue exists in the published literature.

We then briefly review techniques for detecting and dealing with differential moderation of alternative effect-size measures. We reprise Smithson’s (2012) approach to dealing with heteroscedasticity effects, and we review methods for evaluating moderation of tolerance and multiple R 2 in regression. Finally, we discuss implications for research practice, reporting, replication, and meta-analysis.

Effect-sizes in ANOVA and multiple regression

ANOVA and linear multiple regression offer researchers a choice among effect-size measures. The most popular effect-size measures in ANOVA are differences between means (in the scale of the raw data), Cohen’s d, partial η 2, and semi-partial η 2. The most popular effect-size measures in regression are semi-partial and partial correlations, and unstandardized and standardized regression coefficients. We briefly review these alternative measures and their interrelationships before proceeding to discuss moderator effects.

ANOVA

When means are compared between two independent samples, Cohen’s d is the conventional “scale-free” effect-size employed. However, in principle, either Cohen’s d or η 2 may be used. The formula linking the two measures can be written as

$$ {\eta}^2=\frac{t^2}{t^2+N-2}=\frac{d^2}{d^2+4\left(\overline{n}-1\right)/\tilde{n}}, $$
(1)

where t is the t-statistic, N is the total sample size, n 1 and n 2 are the number of observations in each sample, \( \overline{n} \) is the arithmetic mean of n 1 and n 2 , and ñ = 2/(1/n 1 + 1/n 2) is their harmonic mean. Alternative forms of this equation are presented by McGrath and Meyer (2006), in their informative discussion of the differences between the correlation coefficient and d. We have chosen this version because of the role played by the ratio between the arithmetic and harmonic means in the right-hand part of Eq. 1.

In ANOVA for multi-way designs, two popular effect-size measures for main effects are semi-partial and partial η 2 (η s 2 and η p 2, respectively). These are identical to squared semi-partial and partial correlations in linear regression (see below). Useful formulas for the η 2 measures are as follows:

$$ \begin{array}{c}\hfill {\eta}_s^2=\frac{S{S}_j}{S{S}_e+{\displaystyle \sum_iS{S}_i}}=\frac{S{S}_j}{S{S}_T}\hfill \\ {}\hfill {\eta}_p^2=\frac{S{S}_j}{S{S}_e+S{S}_j}=\frac{S{S}_j}{S{S}_T-{\displaystyle \sum_{i\ne j}S{S}_i}}\hfill \end{array} $$
(2)

The SS j term is the sum of squares for the j th effect, SS T is the total sum of squares, and SS e is the error sum of squares. The pairs of formulas suggest two ways of understanding the difference between η s 2 and η p 2.

The middle pair of expressions in Eq. 2 is the “ANOVA” view, in which η s 2 measures SS j against the sums of squares for all effects plus SS e , whereas η p 2 measures SS j against itself plus SS e . Some methodologists (e.g., Tabachnick & Fidell, 2013, pp. 54-55) claim that η s 2 is “flawed” because the j th effect-size may appear smaller in more complex designs with more effects. In any case, it is best to consider η s 2 and η p 2 as addressing different questions about effects.

The right-hand pair of expressions in Eq. 2, with the SS T terms, is what might be called the “regression” view (e.g., Tabachnick & Fidell, 2013, p. 145). Here, η s 2 is viewed as the unique proportion of total variance explained by the j th effect, while η p 2 is the proportion of the variance left over, after the other effects have contributed their shares, explained by the j th effect. A convenient summary of this distinction is

$$ {\eta}_p^2=\frac{\eta_s^2}{1-{\eta}_{s(j)}^2}, $$
(3)

where η 2 s(j) denotes the proportion of variance explained by all of the effects except for the j th effect. Equation 3 also makes clear the well-known inequality that η s 2 < η p 2.

Multiple regression

In multiple regression, in addition to squared semi-partial and partial correlations, standardized regression coefficients often are used as indicators of the relative importance of predictors. This practice has long attracted criticism (e.g., Budescu, 1993). However, we include it in our discussion here, both because of its popularity and because it can be interpreted as an effect-size measure directly related to semi-partial correlations (e.g., Darlington, 1990, p.58).

To fix ideas, we need some alternative and additional notation. We will consider the effect of a predictor, X, on a dependent variable, Y, in the j th sample (for j = 1, 2, …, J). Let A denote the collection of all predictors other than X in the regression model. Let β xj , R sxj and R pxj be the standardized regression coefficient, semi-partial and partial correlation (respectively) for X in the j th sample. Finally, let R Ayj denote the multiple correlation coefficient for the linear regression model predicting Y and containing all of the predictors in set A in the j th sample, and R Axj denote the multiple correlation coefficient for the linear regression model predicting X from all of the predictors in set A in the j th sample.

The relationships among the standardized regression coefficient, semi-partial correlation, and partial correlation may be expressed as follows. First, we may rewrite Eq. 3 as

$$ {R}_{pxj}=\frac{R_{sxj}}{\sqrt{1-{R}_{Ayj}^2}}. $$
(4)

It is also pertinent that the semi-partial correlation for a predictor is the correlation between the dependent variable and the residual of the predictor from a regression model predicting it from the other predictors in the model. The partial correlation, on the other hand, is the correlation between the residuals of both the dependent variable and the predictor, i.e., with the other predictors partialled out from both variables. In some statistical packages (e.g., SPSS), the F-test of significance which is based on the squared partial correlation is confusingly paired with output that reports the squared semi-partial correlation.

Second, the standardized regression coefficient is a function of the semi-partial correlation and “tolerance”, 1 − R 2 Axj , i.e.:

$$ {\beta}_{xj}=\frac{R_{sxj}}{\sqrt{1-{R}_{Axj}^2}}. $$
(5)

Thus, as suggested earlier, the standardized regression coefficient also is an effect-size measure. It compares the semi-partial correlation for a predictor against the variation in that predictor that remains unexplained by the other predictors. The appropriate substitution from Eq. 4 yields the following relationship between the standardized regression coefficient and partial correlation:

$$ {\beta}_{xj}={R}_{pxj}\frac{\sqrt{1-{R}_{Ayj}^2}}{\sqrt{1-{R}_{Axj}^2}}. $$
(6)

As in the preceding material on ANOVA, it should be clear that these three alternative effect-sizes measure “effect-size” in ways that address different research questions.

Moderation of alternative effect-sizes

Cohen’s d versus partial η 2

Let us first examine the impact of unequal cell sizes on moderation of Cohen’s d versus η p 2. For two independent samples, suppose that d is identical for both (i.e., unmoderated). When will the same be true of η p 2? Equation 1 can be rewritten as

$$ d=\frac{\eta_p}{\sqrt{1-{\eta}_p^2}}\times \sqrt{\frac{4\left(\overline{n}-1\right)}{\tilde{n}}}. $$
(7)

Thus, non-identical moderation of these two effect-size measures occurs when the \( \left(\overline{n}-1\right)/\tilde{n} \) ratio differs between the samples. For example, suppose that sample 1 has n 11 = n 12 = 100 whereas sample 2 has n 21 = 185 and n 22 = 15. Then if d 1 = d 2 = 0.9, for sample 1 η p1 = .412 whereas for sample 2 η p2 = .232, so that the "effect-size" is moderated if we use partial eta but not if we use Cohen’s d. Also, Eq. 7 implies that for constant d, the magnitude of η p covaries negatively with the \( \left(\overline{n}-1\right)/\tilde{n} \) ratio. Given constant total sample size, this ratio increases as sample sizes become more unequal.

McGrath and Meyer (2006) discuss the difference between the correlation and d from a somewhat different standpoint, characterizing unequal sample sizes as differing "base rates." Their conclusions parallel ours, although they do not discuss moderation per se. As they point out, base-rate sensitivity implies that for d power is influenced by inequality in sample sizes, whereas for η p it is not. Equation 7 reveals the observation made by Rosnow, Rosenthal, and Rubin (2000) that power is inversely related to the \( \left(\overline{n}-1\right)/\tilde{n} \) ratio.

Can Cohen’s d and η p be moderated in opposite directions? Let \( \left(\overline{n}-1\right)/\tilde{n} \) be denoted by Q, and suppose that for sample 1 this ratio is Q, while for sample 2 the ratio is kQ, where k > 1. Now suppose that for sample 1 partial η 2 is η p1 2, whereas for sample 2 it is η p2 2 =  p1 2, where c < 1 so that η p1 < η p2. Then a straightforward algebraic argument shows that d 1 > d 2 iff (kc − 1)/(kc − c) > η 2 p , which in turn requires that kc > 1. These conditions are by no means bizarre. For instance, suppose that sample 1 has n 11 = n 12 = 25 whereas sample 2 has n 21 = 40 and n 22 = 10, so that Q 1 = 0.96 and Q 2 = 1.5. Suppose also that for sample 1 η 2 p1 =.33 whereas for sample 2 η 2 p2 =.25. Then it follows that k = 1.563 and c = 0.758, so (kc − 1)/(kc − c) = 0.431 > η 2 p1 , and therefore d 1 = 1.375 whereas d 2 = 1.414. Thus, Cohen’s d and η p 2 are moderated in opposite directions.

Form versus degree moderation

Suppose a linear relationship between two continuous random variables X and Y is moderated by a third variable, Z. The extent to which the correlation ρ is moderated by Z (moderation of degree) is equivalent to the extent to which the regression coefficients b y and b x are moderated by Z (moderation of form) if the variance ratio σ 2 y /σ 2 x is constant over the range or states of Z. The same holds for moderation of the difference between means versus moderation of Cohen’s d. Otherwise, moderation of slopes and of correlations (or of mean differences and Cohen’s d) must diverge. Most of the literature on this issue focuses on tests for heterogeneity of variance (HeV) in Y, despite the fact that HeV in X also can render that variance ratio non-constant.

Consider the simplest case, where Z is a binary variable. For the i th category of Z,

$$ {b}_{yi}={\rho}_i\frac{\sigma_{yi}}{\sigma_{xi}}. $$
(8)

A straightforward argument shows that if the σ yi /σ xi ratio is not constant for i = 1 and i = 2 then b 1 = b 2 ⇒ ρ 1 ≠ ρ 2, and likewise b 1 ≠ b 2 ⇒ ρ 1 = ρ 2. More generally,

$$ \frac{\sigma_{y1}{\sigma}_{x2}}{\sigma_{x1}{\sigma}_{y2}}>\left(<\right)1\iff \left|\frac{b_1}{b_2}\right|>\left(<\right)\left|\frac{\rho_1}{\rho_2}\right|. $$
(9)

The condition for correlations and slopes to be moderated in opposite directions follows immediately: We have b 1 > b 2 whereas ρ 2 > ρ 1 if, when ρ 2 > ρ 1, it is also true that

$$ \frac{\sigma_{y1}{\sigma}_{x2}}{\sigma_{x1}{\sigma}_{y2}}>\frac{\rho_2}{\rho_1}. $$
(10)

The same implication holds if the inequalities are changed from > to <. Smithson (2012) argues that this condition is not unusual or extreme, and of course violations of homoscedasticity frequently occur in real data.

These results generalize to multiple regression, so that standardized and unstandardized regression coefficients may be moderated differently when the σ yi /σ xi is not constant, because Eq. 8 becomes

$$ {b}_{yi}={\beta}_i\frac{\sigma_{yi}}{\sigma_{xi}}, $$
(11)

where β i is the standardized regression coefficient.

Moderation of reliability

It is common knowledge that the value of a sample correlation is influenced not only by the true population correlation value but also the reliability of the scales measuring the correlated constructs. Hunter and Schmidt’s (1990) treatment of meta-analysis using correlation coefficients highlights this issue, but it is routinely ignored when researchers consider moderator effects. It is plausible that under many circumstances, scale reliability may be moderated. If so, then that may introduce artefacts into the assessment of moderator effects on correlation coefficients and other effect-size measures that are functions of correlations, such as Cohen’s d and regression coefficients.

The observed squared correlation, \( {\tilde{\rho}}^2, \) is the product of the true squared correlation and the reliabilities of the scales being correlated:

$$ {\tilde{\rho}}^2={\rho}^2{\rho}_x{\rho}_y. $$
(12)

Clearly, identical correlations in two samples may appear to be moderated because the reliabilities of one or both scales differ between the samples. It also is possible for the true correlation to be moderated in the opposite direction to the observed correlation. Letting C = ρ x ρ y , if we have C 2 = kC 1, for k > 1, and ρ 22  =  21 , for c < 1, then \( {\tilde{\rho}}_2^2>{\tilde{\rho}}_1^2 \) iff kc > 1. Suppose, for instance, that for sample 1 ρ 21 =.33 > ρ 22 =.25, whereas the reliabilities for the scales in sample 1 both are .7 and in sample 2 both are .9. Then c = 0.758 and k = 1.653, so kc = 1.252 and thus \( {\tilde{\rho}}_1^2=.162<{\tilde{\rho}}_2^2=.203, \) i.e., moderation in the opposite direction to that for the true correlations. We note in passing that researchers typically use Cronbach’s alpha as a lower bound estimate of population reliability, despite the fact that other reliability estimates are arguably more accurate and useful than alpha (Dunn, Baguley, & Brunsden, 2014; Sijtsma, 2009).

Semi-partial versus partial correlations versus standardized regression coefficient

We now turn to the three effect-size measures available in regression, two of which are also employed in ANOVA. We first need to establish when these effect-size measures have been moderated identically. It should be evident from Eqs. 3, 4, and 5 that the best way to assess moderation of these parameters between independent samples is via their ratios rather than their differences. From Eq. 3 we have

$$ {\eta}_{p1}-{\eta}_{p2}=\frac{\eta_{s1}}{\sqrt{1-{\eta}_{s(j)1}^2}}-\frac{\eta_{s2}}{\sqrt{1-{\eta}_{s(j)2}^2}}. $$
(13)

Even if η 2 s(j)1  = η 2 s(j)2 , when they are not 0 then it still is the case that η p1 − η p2 ≠ η s1 − η s2 unless η p1 − η p2 = 0. On the other hand, from Eq. 3 if η 2 s(j)1 = η 2 s(j)2 then η s1/η s2 = η p1/η p2. Equivalently, from Eq. 4 if R Ay1 = R Ay2 then R sx1/R sx2 = R px1/R px2. Finally, from Eq. 5, if R Ax1 = R Ax2 then R sx1/R sx2 = β x1/β x2. In this paper, we therefore operationalize “identical moderation” of two effect-size measures across two samples as equal ratios for both parameters. Thus, for instance, R sx1/R sx2 = β x1/β x2 is taken to mean that the semi-partial correlation and standardized regression coefficient have been identically moderated across samples 1 and 2.

Ratio comparisons provide practical guidelines for judging when effect-sizes of these kinds have been moderated identically (or replicated) between studies. For the moment, suppose we have an agreed-upon criterion for deciding when each of these effect-size measures has been moderated or not (be it a traditional significance test for their difference, an appropriate Bayes factor, or some other alternative). Then the following three propositions hold.

  1. 1.

    If the multiple correlations R Ay1 and R Ay2 are unmoderated (R Ay1 = R Ay2 ) then partial and semi-partial correlations are moderated identically, whereas the corresponding standardized regression coefficients may be moderated differently.

  2. 2.

    If the multiple correlations R Ax1 and R Ax2 are unmoderated (R Ax1 = R Ax2 ) then semi-partial correlations and standardized regression coefficients will be moderated identically, but partial correlations may be moderated differently.

  3. 3.

    If both pairs of multiple correlations are moderated, all three effect-size measures are moderated differently from one another.

How likely is differential moderation of these alternative effect-size measures? Partial and semi-partial η 2 are very likely to be moderated differently across samples. Equations 4 and 5 reveal that, for any two experiments with identical designs, if η 2 s1  = η 2 s2 then η 2 p1  = η 2 p2 if and only if \( {\displaystyle \sum_iS{S}_{i1}}={\displaystyle \sum_iS{S}_{i2}} \), and vice-versa. This strong constraint is seldom likely to be realized in research, even in carefully controlled experiments. A similar argument follows regarding the differential moderation of partial and semi-partial correlations for non-experimental studies involving multiple covariates.

Somewhat ironically, the magnitude of the differential moderation of alternative effect-sizes may increase with better multivariate models. That is, the larger the squared multiple correlation coefficients, the larger the discrepancy between effect-size measures. In two studies with the same multiple regression model, suppose that one predictor has R sx1 2 = R sx2 2 = .1 in both studies, so that the semi-partial correlations are perfect replicates, i.e., unmoderated. Suppose that for the other predictors in the model, R Ay1 2 = .2 and R Ay2 2 = .5. Then R px1 2 = .112 and R px2 2 = .141, so R px2 2/R px1 2 = 1.265 which indicates moderation of R px . But now suppose R Ay1 2 = .5 and R Ay2 2 = .8, so that the difference between them is the same as before but the model fits the data much better. Then R px1 2 = .141 and R px2 2 = .224, so R px2 2/R px1 2 = 1.581, a greater degree of moderation of the partial correlations. Finally, suppose that the ratio, R Ay1 2/R Ay2 2, remains the same, with R Ay1 2 = .32 and R Ay2 2 = .8. Then R px1 2 = .121 and R px2 2 = .224, so R px2 2/R px1 2 = 1.844, an even greater moderator effect.

As we have seen earlier in comparisons between alternative effect-size measures, it is possible for moderation to run in opposite directions for these alternative measures. Suppose that for two independent samples, η s1 < η s2 so that η s1/η s2 = w < 1. Then from Eq. 3, if η s1 and η s2 have the same sign, η 2 p1  > η 2 p2 when \( \sqrt{1-{\eta}_{s(j)1}^2}/\sqrt{1-{\eta}_{s(j)2}^2}<w. \) Similarly, from Eq. 5 it is clear that the semi-partial correlation and standardized regression coefficients can be moderated in opposite directions. Suppose that we have two independent samples with multiple regression models containing the same predictors, and β x1/β x2 = w < 1, so that β x1 < β x2. Then if β x1 and β x2 have the same sign, R sx1 > R sx2 when \( \sqrt{1-{R}_{Ax1}^2}/\sqrt{1-{R}_{Ax2}^2}<w \). Both of these reversals require that the “other” predictors’ effects are moderated in the opposite direction from the predictor whose effect’s moderation we are investigating. That is, in the first case, where η s1 < η s2, we require that η 2 s(j)2  < η 2 s(j)1 . In the second case, where β x1 < β x2, we require R 2 Ax2  < R 2 Ax1 . Neither requirement is outlandish, although instances of the first case probably are rarer than instances of the second (which involves two different dependent variables). However, the second case is less likely to be investigated by researchers for the same reason that, as Smithson (2012) observes, researchers seldom concern themselves with heteroscedasticity in a predictor.

The take-home lesson is that in a multiple linear regression model, moderation of alternative effect-size measures for any single predictor is partly determined by what else is being (un)moderated in the model. Replication or moderation of one effect-size measure across samples is no guarantee of replication or identical moderation of an alternative effect-size measure across the same samples.

Detecting and dealing with differential moderator effects

Cohen’s d and partial η 2

If the \( \left(\overline{n}-1\right)/\tilde{n} \) ratio varies across samples, then there are differentially unequal sample sizes, but unfortunately the converse does not hold. For example, two independent samples with cell sizes of {40, 10} and {10, 40} will yield a significant chi-square test for unequal proportions (χ 2(14) = 36.00, p < .0005), but identical ratios, \( \left(\overline{n}-1\right)/\tilde{n}=1.5 \). A reasonable procedure is to first test for unequal proportions across studies, and then “align” the highest and lowest cell frequencies and re-test for unequal proportions.

In our earlier example, sample 1 had n 11 = n 12 = 25 whereas sample 2 had n 21 = 40 and n 22 = 10. Here, there is no need to align the highest and lowest cell frequencies because one pair of them is identical (the test for equal proportions gives χ 2(1) = 9.890, p = .0017). Suppose instead that the first sample had cell sizes n 11 = 20 and n 12 = 30. Now the chi-square test yields χ 2(1) = 16.667 (p < .0005). If we align the cells so that we have {30, 20} and {40, 10}, then the chi-square test yields χ 2(1) = 4.762 (p = .0291), still significant at the .05 level but reduced due to the alignment of the larger and smaller cell frequencies. Note that we also still observe differential moderation of d and η. As before, η 2 p1 =.33 whereas η 2 p2 =.25, and now d 1 = 1.404, nearly equal to d 2 = 1.414.

Can this kind of discrepancy identified in Eq. 7 occur in a collection of studies? Table 1 shows effect-sizes from Feingold’s (1994) meta-analysis of studies comparing male and female samples’ means on personality measures, in this case the subset comparing them on measures of assertiveness. Six studies (1, 9, 10, 12, 14, 15) have very unequal sample sizes (n 1 = number of males and n 2 = number of females). Applying the procedure described above, a chi-square test for equal proportions across studies yields χ 2(14) = 809.61 (p < .0005) and a chi-square test for “aligned” pairs of sample sizes still is very large, with χ 2(14) = 457.25 (p < .0005). We may conclude that the \( \left(\overline{n}-1\right)/\tilde{n} \) ratios vary across studies, with Study 1 a clear outlier in this regard.

Table 1 Studies with male-female comparisons on assertiveness

As a result, the unequal sample sizes in the studies result in differential moderation of d and η across the studies. Study 1 has d = 0.26, twice that of studies 14 and 15; but the three corresponding η values are almost identical (.069, .060, and .062, respectively). Study 9 also has d = 0.26, identical to study 1, but its η = .117, much larger than study 1. Studies 4 and 5, both with d less than d for study 1, have η values greater than η for study 1, thus showing moderation of the two effect-size measures in opposite directions.

Differential Form versus Degree Moderation

Smithson (2012) presents a parametric test for the between-sample equality of the variance ratio σ 2 y /σ 2 x (EVR) based on the log-likelihood of a bivariate normal distribution for X and Y conditional on a categorical moderator Z, employing submodels for the standard deviations using a log link. He reports evidence supporting the Type I error-rate accuracy of this test and reasonable power for moderate departures from normality in X and Y. He also extends this test to the case where Z is a continuous moderator, along with simulation studies examining its Type I error-rates and power. Scripts for maximum likelihood estimation in R, SPSS and SAS are available via the link provided in Smithson (2012).

For a categorical moderator, Smithson (2012) discusses incorporating the EVR test in a structural equations model (SEM) approach that enables researchers to test simultaneously for EVR, HeV in the IV and DV, homogeneity of error variance, moderation of correlations, and moderation of slopes. He provides examples in two SEM packages that can fit these models: lavaan (Rosseel, 2012) and MPlus (Muthén & Muthén, 2010). Readers may consult Smithson (2012) for further details, examples, and a link to worked examples in both environments.

Multiple regression and ANOVA: Comparing squared multiple correlations

Because detecting differential moderation of alternative effect-size measures in multiple regression and multi-way ANOVA hinges on detecting the moderation of squared multiple correlations, we require methods for estimating confidence intervals (CIs) around differences between squared multiple correlations. We survey five methods: Asymptotic, “modified asymptotic,” transformations to normality, bootstrapping, and estimation via structural equations models.

Olkin and Finn (1995) describe asymptotic methods for constructing CIs for the difference between two squared multiple correlations. Briefly,

(R 21  − R 22 ) − (ρ 21  − ρ 22 ) ∼ N(0, σ 2 ), where σ 2  = (4/n 1)R 21 (1 − R 21 )2 + (4/n 2)R 22 (1 − R 22 )2, with n j denoting the sample sizes. However, as Algina and Keselman (1999) observe, this approach does not work well unless sample sizes are very large, and so we do not consider it further here.

Zou (2007) presents a “modified asymptotic” approach to constructing CIs for the difference between two correlations or between two squared correlations. For two independent squared multiple correlations, R 1 2 and R 2 2, his procedure is as follows. First, use a scaled noncentral F approximation to the distribution of the squared multiple correlation to obtain CIs around each of them, [l 1, u 1] and [l 2, u 2], respectively. Then, compute the lower and upper limits of the CI around R 1 2R 2 2 by these formulas:

$$ \begin{array}{c}\hfill L={R}_1^2-{R}_2^2-\sqrt{{\left({R}_1^2-{l}_1\right)}^2+{\left({u}_2-{R}_2^2\right)}^2}\hfill \\ {}\hfill U={R}_1^2-{R}_2^2+\sqrt{{\left({R}_2^2-{l}_2\right)}^2+{\left({u}_1-{R}_1^2\right)}^2}\hfill \end{array}. $$
(14)

Zou demonstrates that this approach outperforms asymptotic methods in the accuracy of CI coverage-rates for moderate sample sizes. However, a major limitation of this method is that it does not generalize to more than two samples.

Algina and Keselman (1999) investigated a variance-stabilizing transformation of the squared multiple correlation to normality proposed by Olkin and Finn, reporting minimum sample sizes required for adequately accurate CI coverage-rates under a variety of conditions. The transformation is

$$ z= \log \left(\frac{1+\sqrt{R^2}}{1-\sqrt{R^2}}\right) $$
(15)

with asymptotic variance 4/n. Thus, a CI around the difference between z 1 and z 2 is approximated by \( {z}_1-{z}_2\pm {t}_{\alpha /2}\sqrt{4/{n}_1+4/{n}_2} \). This is not the only such transformation (see, e.g., Hodgson, 1968), but in simulations it performs as well as or better than the other proposals (details are available from the first author), so we do not consider the others here.

An advantage of the transformation in Eq. 15 is that its approximation to the normal distribution allows a generalization to comparisons among more than two squared multiple correlations. An overall measure of the heterogeneity of K squared multiple correlations is obtained via the standard chi-square statistic:

$$ V=\frac{{\displaystyle \sum_{i=1}^K{\left({z}_i-{z}^{+}\right)}^2/{\sigma}_i^2}}{{\displaystyle \sum_{i=1}^K1/{\sigma}_i^2}}, $$
(16)

where z + is the weighted mean of the z i , with weights defined as 1/σ 2 i , and σ 2 i  = 4/n i . Asymptotically, z i  − z + ∼ N(0, σ 2 i ), so that when the null hypothesis is true, V ∼ χ 2 K − 1 . Otherwise, for a fixed-effects model, V has a noncentral chi-square distribution. Its noncentrality parameter is the sum of squared standardized effects (Smithson, 2003: 43), and it can be converted to a squared partial correlation coefficient that can be used as an effect-size measure in this context. A CI around the noncentrality parameter therefore can be transformed to a CI around this effect-size. Denoting the noncentrality parameter by ν, the transformation to a squared partial correlation is

$$ {\eta}^2=\frac{\nu }{\nu +N-1}, $$
(17)

where N is the sum of the sample sizes.

Chan (2009) presents a bootstrap method for comparing two squared multiple correlation coefficients. Let X be a vector of predictors of Y, and suppose there are two independent samples of these variables, S 1 and S 2 with sizes N 1 and N 2 , from populations whose squared multiple correlation coefficients are ρ 21 and ρ 22 , respectively. Chan’s bootstrap procedure is as follows.

Now suppose we take B bootstrap samples. For b = 1, 2,…, B:

  1. 1.

    Randomly select N 1 and N 2 cases, (x i1b , y i1b ) and (x j2b , y j2b ) respectively for i = 1, 2, …, N 1 and j = 1, 2, …, N 2, with replacement from S 1 and S 2 .

  2. 2.

    Compute the predicted values ŷ i1b and ŷ j2b , from these and their sample means compute the sample squared coefficients R 21b and R 22b , and then obtain d b  = R 21b  − R 22b .

The bootstrap standard error (BSE) is then

$$ {\widehat{\sigma}}_B=\sqrt{\frac{{\displaystyle {\sum}_{b=1}^B{\left({d}_b-\overline{d}\right)}^2}}{B-1}}. $$
(17)

The bootstrap CI (BCI) then is

$$ \left[\overline{d}-{\widehat{\sigma}}_B{z}_{1-\alpha /2},\overline{d}+{\widehat{\sigma}}_B{z}_{1-\alpha /2}\right], $$
(18)

and the bootstrap percentile interval is the appropriate percentiles of the bootstrap cumulative distribution of the rank-ordered d b .

Finally, Kwan and Chan (2014) propose a two-stage structural equations model (SEM) approach for comparing squared multiple correlations across groups. Unlike an earlier “phantom variable” SEM method for comparing squared multiple correlations (Cheung, 2009), their approach is not limited to comparing two groups. In the first stage, the original multi-group model is transformed into a model such that the squared multiple correlation coefficient becomes a free model parameter in the transformed model. In the second stage, the squared multiple correlations in the groups are compared by imposing linear between-group constraints on the parameters of interest in the transformed SEM, and model comparisons (e.g., between a null-hypothesis model where the squared correlations are identical versus the alternative model in which they differ) are performed via likelihood ratio tests.

Examples

For illustrative purposes, we present two examples, one using ANOVA and another with multiple regression. For simplicity, we restrict this presentation to three techniques: The Olkin-Finn transformation to normality, Zou’s CIs, and Chan’s bootstrap. We also do not illustrate form versus degree moderation; for illustrations thereof we refer the reader to Smithson (2012).

Our first example is an artificial 2 × 2 × 2 between-subjects factorial experimental design, with factors A, B, and C, and 20 observations in each cell (data and details of analyses are available from the first author). Table 2 shows the sample sums of squares, partial η 2, 95 % CIs for partial η 2, and semi-partial η 2 values. There is a moderate main effect for factor A, a strong main effect for C, a strong A*C interaction effect, and a strong 3-way interaction effect.

Table 2 Three-way ANOVA example

Suppose that we wish to interpret the interaction effects by using factor C as a stratifying moderator and computing the resulting simple effects for each panel of C. Table 3 displays the results of these analyses. The ratios of η p and η s for factor A are similar, 1.939 and 2.147, respectively, and their ratio is 1.107. However, the ratios for factor B are 0.177 and 0.143, respectively, giving a ratio of 1.238. Likewise, the ratios of η p and η s for the A*B effect are 1.011 and 0.830 giving a ratio of 1.218. The ratio of the \( \sqrt{1-{\eta}_{s(j)}^2} \) terms for B is 0.806 and the ratio of the \( \sqrt{1-{\eta}_{s(j)}^2} \) terms for A*B is 0.821. This latter ratio is smaller than the corresponding η s ratio (0.830), so that η p and η s are moderated in opposite directions.

Table 3 Three-way ANOVA simple effects

The 95 % CIs around the differences between the η 2 s(j) terms suggest that the semi-partial and partial correlations are moderated differently for the A*B effect. The Chan 95 % BCa bootstrap intervals are [−0.102, 0.340] for factor A, [−0.023, 0.388] for factor B, and [0.047, 0.499] for A*B. The Zou 95% intervals are [−0.098, 0.391] for factor A, [−0.033, 0.385] for factor B, and [0.120, 0.548] for A*B, reasonably similar to the bootstrap results. The Olkin−Finn technique agrees qualitatively with these assessments, yielding 95 % CIs of [−0.374, 1.072] for factor A, [−0.155, 1.291] for factor B, and [0.104, 1.550] for A*B. We may conclude that the partial and semi-partial correlations for the A*B effect are moderated differently from each other, with the semi-partial correlations being moderated more strongly and (slightly) in the opposite direction. The squared partial correlations for A*B are quite similar, at .449 and .439 for levels 1 and 2 on factor C, whereas the squared semi-partial correlations differ substantially, at .274 and .398.

Our final example is a multiple regression model with data from a study by Shin (2014), which focuses on risk-taking and psychological resilience. The dependent variable (Y) is the score on a risk-taking disposition scale (Blais & Weber, 2006), with predictors consisting of participants’ gender (G) and two covariates, a measure of psychological resilience (X 1, Smith, et al., 2008) and a measure of ruminative thinking (X 2, Brinker & Dozois, 2009). The model is

$$ {Y}_i^{\hbox{'}}={\beta}_0+{\beta}_1{X}_1+{\beta}_2{X}_2+{\beta}_3G+{\beta}_4{X}_2G, $$
(15)

so G takes the role of moderating the effect of ruminative thinking on risk-taking disposition.

The top part of Table 4 displays the unstandardized regression coefficient estimates and standard errors, and the standardized coefficients for this model. The remaining two parts of Table 4 show the simple-effects regression models for males and females. We now consider whether the partial correlations, semi-partial correlations, or standardized regression coefficients for X 2 have been moderated differently by gender. From Table 4, the standardized regression coefficients are .421 for the males and .193 for the females, and their ratio is 2.186. The corresponding partial correlations turn out to be .417 and .165 and their ratio is 2.529, while the semi-partial correlations are .402 and .160 and their ratio is 2.511, so the moderation effect appears to be stronger for both kinds of correlations than for the standardized regression coefficient.

Table 4 Regression model

Table 5 shows that the three methods of evaluating the differences between the relevant R Axj 2 pair and between the R Ayj 2 pair agree qualitatively. The CIs for the difference between R Ax1 2 and R Ax2 2 contain only positive values, suggesting that the semi-partial correlation and the standardized regression coefficient are moderated differently. The CI for the difference between R Ay1 2 and R Ay2 2 contains 0, so it is not clear whether the partial and semi-partial correlations are moderated differently from each other (although their ratios are very similar, so they probably are moderated similarly).

Table 5 The 95 % CIs for differences between R Axj 2 and R Ayj 2 pairs

A systematic comparison of alternative methods for detecting differences between squared multiple correlations has yet to be done, and this is an active topic of research. Nevertheless, the state of the art indicates that we have some serviceable methods for this purpose.

Conclusions and recommendations

The conditions under which differential moderation of alternative effect-size measures can occur are quite likely to crop up in multivariate research. Differential moderation of alternative effect-size measures poses a problem for both meta-analysis and the interpretation of moderator effects within a study. A simple solution would be for all researchers to use just one effect-size measure and ignore the others (the partial correlation in preference to the semi-partial correlation, for example). However, Smithson’s (2012) review of the scattered literature on differential moderation of simple slopes and correlations identified contradictory published advice regarding whether tests of simple slopes should be preferred over tests of correlations or vice-versa. Smithson concludes that a superior approach would be to model both parameters, and the relevant variance ratios, and ascertain when and how these are moderated differently. McGrath and Meyer (2006, p.398) provide a similar recommendation regarding the choice between η and Cohen’s d (also see our summary discussion below).

Likewise, here we argue that a more adaptive response is to recognize that alternative effect-size measures can be moderated differently and to take this into account when addressing questions about moderator effects and/or replications of studies. The keys to doing this reside in recognizing that alternative effect-size measures convey different information about effects, bearing in mind that replication or moderation outcomes depend on the choice of an effect-size measure, undertaking to model more than one effect-size measure, and taking reliability into account where possible. The factors driving divergent moderation and replication outcomes for alternative effect-size measures are unequal sample sizes (or base-rates), moderated scale reliability, heterogeneity of variance, and multiple moderator effects involving the dependent variable and/or its predictors. We will conclude by briefly discussing the implications of each of these for research practice and reporting.

The discrepancy between moderation of d and η is driven by moderation of the ratio \( \left(\overline{n}-1\right)/\tilde{n} \). As established by McGrath and Meyer (2006), the choice between d and η revolves around the issue of whether the researcher’s purposes are best served by a base-rate sensitive measure (η) or a base-rate insensitive measure (d). If the moderation of \( \left(\overline{n}-1\right)/\tilde{n} \) reflects a relevant phenomenon (e.g., different rates of a psychological disorder across subpopulations) then η might be preferred over d, whereas the converse would hold if moderation of \( \left(\overline{n}-1\right)/\tilde{n} \) is due to an irrelevant happenstance. Where there are no clear-cut reasons for preferring one statistic over the other, reporting both and assessing the moderation of sample sizes would be prudent.

The moderation of scale reliability can affect moderation of both d and η. It therefore stands as a potential explanatory factor for heterogeneity among effect-sizes in meta-analyses as well as among independent samples in the one study. Differential reliability across samples or studies clearly is important, both because of its implications regarding moderation and replication and because it is directly related to issues of measurement invariance.

Heterogeneity of variance drives the discrepancy between the moderation of unstandardized and standardized regression coefficients (or the special case of the simple regression coefficient versus correlation). We will not review the long-running debates regarding unstandardized vs standardized regression coefficients, but note that heterogeneity of variance is an additional factor for researchers to consider where moderation or replication is concerned. Above all, researchers should be aware that both are unlikely to be moderated identically, so a test for one is not a test for the other, and ideally they should examine variance heterogeneity in predictors as well as in the dependent variable. Unlike base-rate sensitivity of d vs η, it is not the case that one statistic is sensitive to variance heterogeneity whereas the other is not; instead both are differentially affected by it.

Finally, in multivariate studies, multiple moderator effects may cause discrepancies between the moderation of partial correlation, semi-partial correlation, and standardized regression coefficients. This is the case for moderator effects on the predictor under consideration as well as the dependent variable. Again, we will not enter debates such as whether to prefer partial over semi-partial correlations, but simply note that if researchers are going to choose just one of them then they should provide a clear rationale for doing so. Ideally, they should also report moderation of the relevant alternative measures when assessing moderator effects. If partial correlations are preferred, they are a function of the semi-partial correlation and R Ayj , so it is wise to consider reporting moderator effects on those two statistics as well. Likewise, if standardized regression coefficients are preferred, then moderator effects on the semi-partial correlation and R Axj would be relevant to report.

At the very least, researchers will be wise to exercise caution regarding claims about effect-size homogeneity or moderation in multivariate studies and meta-analyses, especially where questions of replication arise. Researchers who elect one effect-size measure should provide a rationale for that choice, and make it clear when claims about moderation or replication pertain only to that measure and not to alternative measures. It is essential to avoid the trap of believing that a test for moderation of one measure is a test for all. Ideally, future meta-analyses of multivariate studies should incorporate the techniques described in this paper for identifying and modeling differential moderation of alternative effect-size measures.