Mediation analysis is widely considered a promising method for providing an account of the mechanism through which an intervention has an effect on the targeted outcome (e.g., Gottfredson et al., 2015; MacKinnon, 2008). The total intervention effect is decomposed into a direct effect of the treatment and an indirect effect component that describes the effect of the treatment on the outcome via one or more mediating variables. In general, the indirect effect serves as a tentative explanation as to why the treatment effect occurs, whereas the direct effect can be interpreted as a general measure of all effects not explained by the mediator. The analysis of mediation processes has been largely dominated by Baron and Kenny’s (1986) stepwise approach, and the majority of research on mediation analysis has focused on the robustness and power of tests of mediation (see, e.g., Fritz & MacKinnon, 2007; Hayes & Scharkow, 2013; Judd & Kenny, 2010; MacKinnon, Lockwood, Hoffman, West, & Sheets, 2002; Shrout & Bolger, 2002). However, the recent reconceptualization of total, direct, and indirect effects that uses the counterfactual framework of causation (Imai, Keele, & Tingley, 2010; Imai, Keele, & Yamamoto, 2010; Keele, 2015; Pearl, 2001, 2012; VanderWeele, 2015) has provided a new framework for understanding the exact conditions under which mediation effects can be endowed with a causal interpretation. Due to this reconceptualization, it is now well understood that even when the intervention (the predictor) is under experimental control, unconfoundedness assumptions (similar to those made for purely observational data) must be imposed on the mediator–outcome relation. This observation is not entirely new, because similar statements can already be found in Judd and Kenny’s (1981) original exposition.

Blockage or enhancement designs have been proposed (see, e.g., Imai, Keele, Tingley, & Yamamoto, 2011; Pirlott & MacKinnon, 2016) that enable researchers to experimentally control the mediator in addition to the predictor. However, these designs also require strong assumptions. For example, only the mediator of interest can be affected by the manipulation, and no other potential mediating variable should be affected (e.g., Bullock, Green, & Ha, 2010). From a purely statistical perspective, alternative approaches to traditional mediation analysis have been proposed that may serve as a remedy when the mediator–outcome path is prone to confounding. For example, Zheng, Atkins, Zhou, and Rhew (2015) focused on the rank-preserving model (RPM; see also Small, 2012; Ten Have et al., 2007), which allows consistent estimation of mediation effects in the presence of confounders. However, the RPM also rests on unique assumptions. For example, one key requirement is that at least one covariate exists that has a strong interaction effect with the intervention predicting the mediator; that is, at least one moderator of the causal effect of the intervention on the mediator is needed. Alternatively, instrumental variables (IVs; see, e.g., Angrist & Krueger, 2001; Angrist & Pischke, 2009) can be used to generate consistent parameter estimates under confounding. In essence, IVs are used to isolate that part of the variation in the explanatory variable that is uninfluenced by the confounder. For an IV to be reliable, two conditions must be met (cf. Pearl, 2009): The IV must be (1) independent of the error term of the model (representing exogenous factors that affect the outcome when the explanatory variable under study is held constant; this is known as the exclusion restriction) and (2) not independent of the explanatory variable (often called the “strength” of an IV). Both conditions are crucial for valid results. Bound, Jaeger, and Baker (1995) showed that “weak” IVs (i.e., IVs that explain little of the variation in the explanatory variable) lead to biased effect estimates. The exclusion restriction condition cannot be tested directly in just-identified models (i.e., models with as many IVs as explanatory variables), and thus, a strong substantial rationale is usually needed to justify the status of a variable as a reliable IV.

When these alternatives are not feasible, researchers are advised to use sensitivity analysis (e.g., Cox, Kisbu-Sakarya, Miočević, & MacKinnon, 2013; Imai, Keele, & Yamamoto, 2010; Imai et al., 2011; Mauro, 1990) or significance tests to evaluate the unconfoundedness assumption (in econometrics, these are often referred to as tests of exogeneity; see, e.g., Blundell & Horowitz, 2007; Caetano, 2015; de Luna & Johansson, 2014; Donald, Hsu, & Lieli, 2014; Hausman, 1978). Although sensitivity analysis is useful to assess the robustness of the empirical conclusions drawn from a mediation model to potential confounding, in the present study we focus on significance tests to detect whether influential confounding is present in an estimated mediation model. Common tests of unconfoundedness, again, require the availability of IVs (e.g., Blundell & Horowitz, 2007; de Luna & Johansson, 2014; Donald, Hsu, & Lieli, 2014; Hausman, 1978; Wooldridge, 2015). Caetano (2015) proposed a discontinuity test of exogeneity for a single predictor in a multivariate model that does not require an IV. However, this procedure depends on the continuity of the causal effect of the explanatory variable on the outcome, and it detects the presence of confounders by means of discontinuities in the expected outcomes conditional on all variables. Thus, the test depends on data situations in which discontinuities can unambiguously be attributed to the existence of confounders.

The present study focused on testing the unconfoundedness assumption without requiring IVs or discontinuities in the expected outcome. Instead, the proposed method makes use of higher-than-second moments of variables. In other words, the approach presented here assumes that variables are nonnormally distributed. Higher-than-second moment information of variables has been used in the past in the development of causal discovery algorithms (Mooij, Peters, Janzing, Zscheischler, & Schölkopf, 2016; Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006; Shimizu et al., 2011), confirmatory methods to test the direction of dependence in linear models (Wiedermann & Li, 2018; Wiedermann & von Eye, 2015), estimation algorithms in independent component analysis (Hyvärinen, Karhunen, & Oja, 2001), and search algorithms for covariate selection in linear models (Entner, Hoyer, & Spirtes, 2012). In the present study, we discuss similar principles for the development of unconfoundedness tests in mediation analysis and evaluate their performance in detecting potential confounders in mediator–outcome relations with randomized treatment.

The remainder of this article is structured as follows: First, we review the assumptions about mediation models that need to be made when endowing indirect-effect parameters with causal meaning, and discuss the consequences of violated unconfoundedness assumptions concerning the mediator–outcome relation. Second, we show that, in the presence of a confounder, nonindependence of a mediator and regression errors can be detected when the latter is nonnormally distributed, and we propose a simple, two-step regression approach to evaluating whether unconfoundedness holds for the mediator–outcome component of a mediation model. Third, we introduce the Hilbert–Schmidt independence criterion (Gretton et al., 2008) as a kernel-based measure of independence that is able to detect nonindependence in linearly uncorrelated variables, and then discuss related asymptotic and resampling-based significance tests. Fourth, results from an extensive Monte Carlo simulation experiment are presented that (1) quantify the magnitude of bias of the indirect effect that can be expected due to confounding, and (2) evaluate the performance (i.e., size and statistical power) of independence tests when continuous or categorical confounders are present. Fifth, a real-world data example is presented that demonstrates how the proposed approach can be used to minimize the risk of erroneous conclusions concerning mediation processes due to confounding. We close the article with a discussion of data requirements to guarantee best-practice applications, and we provide a Monte Carlo–based power analysis tool for Type II error control of unconfoundedness tests.

Confounders in mediation models

Many previous studies have described the issue of biased parameter estimates in mediation models when confounders are present (see, e.g., Bullock et al., 2010; Fritz, Kenny, & MacKinnon, 2016; Greenland & Morgenstern, 2001; Imai, Keele, & Tingley, 2010; Imai, Keele, & Yamamoto, 2010; Judd & Kenny, 1981, 2010; MacKinnon, 2008; MacKinnon, Krull, & Lockwood, 2000; MacKinnon & Pirlott, 2015; Vanderweele, 2010, 2015). Assuming that the mediation mechanism can validly be characterized by the linear model and, without loss of generality, that continuous variables are standardized to exhibit zero means and unit variances, the standard (simple) mediation model with a randomized exposure (x; e.g., 0 = control, 1 = treatment), a continuous mediator (m), and a continuous outcome (y) is

$$ {\displaystyle \begin{array}{l}m= ax+{e}_m,\\ {}y= cx+ bm+{e}_y,\end{array}} $$
(1)

where c is the direct effect of x on y, ab defines the indirect effect of x on y through m, and c + ab denotes the total effect of x on y. Furthermore, em and ey are mutually independent error terms with zero means and variances \( {\upsigma}_{e_m}^2 \) and \( {\upsigma}_{e_y}^2 \). Parameter estimates \( \widehat{a} \), \( \widehat{b} \), and \( \widehat{c} \) are usually obtained using ordinary least squares (OLS) regression or structural equation models (SEMs) under maximum likelihood or weighted least squares loss function (see, e.g., Iacobucci, Saldanha, & Deng, 2007; MacKinnon, 2008). The following assumptions are necessary to ensuring that \( \widehat{c} \) and \( \widehat{a}\widehat{b} \) constitute unbiased estimates for the direct and indirect effects (cf., e.g., Imai, Keele, Tingley, & Yamamoto, 2014; Loeys, Talloen, Goubert, Moerkerke, & Vansteelandt, 2016; Pearl, 2014):

  • (A1) no unmeasured confounder of the relation between x and m,

  • (A2) no unmeasured confounder of the relation between x and y,

  • (A3) no unmeasured confounder of the relation between m and y, and

  • (A4) no common causes m and y that are affected by x.

Although assumptions A1 and A2 can be expected to hold when the predictor x is randomized, assumptions A3 and A4 are never guaranteed to be satisfied even when x is under experimental control. In particular, A3 states that the absence of common causes is required, which implies that the covariance of em and ey is assumed to be zero, cov(em, ey) = 0—that is, unobserved causal influences of m are uncorrelated with the unobserved causal influences of y.

In contrast, when an unobserved confounder (u) is present, Model (1) extends to

$$ {\displaystyle \begin{array}{l}m= ax+{d}_mu+{e}_m,\\ {}y= cx+ bm+{d}_yu+{e}_y,\end{array}} $$
(2)

where dm and dy quantify the magnitude of the confounding effects. Erroneously using Model (1) to obtain OLS estimates for c and b leads to the biased estimates (see Appendix A of Loeys et al., 2016, for a proof)

$$ {\displaystyle \begin{array}{l}\widehat{c}=c-a{d}_m{d}_y\frac{\upsigma_u^2{\upsigma}_x^2}{\upsigma_x^2{\upsigma}_m^2-\operatorname{cov}{\left(x,m\right)}^2},\\ {}\widehat{b}=b+{d}_m{d}_y\frac{\upsigma_u^2{\upsigma}_x^2}{\upsigma_x^2{\upsigma}_m^2-\operatorname{cov}{\left(x,m\right)}^2},\end{array}} $$
(3)

From the equations above, it follows that the magnitude of biases increase with the magnitude of confounding effects. In other words, even under randomization of x spurious effects of the my relation can occur (Fritz et al., 2016; Judd & Kenny, 1981; Loeys et al., 2016; MacKinnon, 2008; MacKinnon & Pirlott, 2015). In the following section, we show that by making use of variable information beyond means, variances, and covariances—that is, variable information that is accessible when variables deviate from the normal distribution, the unconfoundedness assumption of the mediation model becomes testable.

Detection of confounders under nonnormality

In the present study, detecting influential confounders under nonnormality essentially relies on properties of (in)dependence of sums of random variables summarized in the Darmois–Skitovich (DS) theorem (Darmois, 1953; Skitovich, 1953). This theorem states that if two linear functions v1 and v2 of the same independent continuous random variables wi (i = 1 , . . . , k with i ≥ 2), \( {v}_1={\sum}_i^k{\upalpha}_i{w}_i \), and \( {v}_2={\sum}_i^k{\upbeta}_i{w}_i \) with αi and βi being constants, are independent, then all component variables wi where αiβi≠ 0 follow a normal distribution. The reverse corollary, however, implies that if a common wi exists that is nonnormal, then v1 and v2 must be nonindependent. It is easy to show that this reverse corollary applies in the context of the mediator–outcome relation whenever the confounder u or the error term em deviates from normality.

We start with partialing out the effect of the randomized intervention x that, in the present study, is assumed to be represented by a binary variable. However, the same approach can be used when the treatment variable is polytomous (here j – 1 dummy variables are used to partial out the effect of j study groups) or continuous. This can be done by extracting the estimated residuals of two auxiliary regressions in which m and y are regressed on x; that is, m ′ = m − ax and y′ = y − cx can be conceptualized as “purified” measures of m and y, with a′ being the regression coefficient when regressing m on x, and c′ being the regression coefficient when regressing y on x. The mediator–outcome part of Model (2) can then be expressed as

$$ \begin{array}{c}{m}^{\prime }={d}_mu+{e}_m,\\ {}{y}^{\prime }=b{m}^{\prime }+{d}_yu+{e}_y,\\ {}=\left(b{d}_m+{d}_y\right)u+b{e}_m+{e}_y.\end{array} $$
(4)

According to the regression anatomy formula, first described by Frisch and Waugh (1933; see Angrist & Pischke, 2009, for further details), it follows that the parameters b, dm, and dy, and the regression errors in Eq. (4) are identical to the ones obtained through Model (2). In other words, the outcome variation that can be explained by independent variables used in a two-step process is identical to the variation explained by the multiple regression model that considers all independent variables simultaneously. Therefore, it follows that the residual terms (i.e., the unexplained variation in the outcome) will also be identical for the two approaches, which implies that either set of residuals conveys exactly the same information about the unknown population errors (cf. Lovell, 2008).

Assuming that the confounder u is erroneously ignored, the model \( {y}^{\prime }={b}^{\prime }{m}^{\prime }+{e}_y^{\prime } \) gives the biased OLS estimate of the mediator–outcome relation (for simplicity, now denoted with b) described in Eq. (3). Furthermore, the error term \( {e}_y^{\prime } \) associated with the misspecified partial regression model can be rewritten as

$$ {\displaystyle \begin{array}{c}{e}_y^{\prime }={y}^{\prime }-{b}^{\prime }{m}^{\prime}\\ {}={bm}^{\prime }+{d}_yu+{e}_y-{b}^{\prime}\left({d}_mu+{e}_m\right)\\ {}=\left[\left(b-{b}^{\prime}\right){d}_m+{d}_y\right]u+\left(b-{b}^{\prime}\right){e}_m+{e}_y.\end{array}} $$
(5)

Because (b − b)≠ 0 will hold when the mediator–outcome path is confounded (i.e., when dmdy≠ 0), it follows from Eq. (5) that m and \( {e}_y^{\prime } \) consist of the same independent (continuously distributed) random variables u and em when a confounder is present. Thus, assuming that at least one of the two common components deviates from normality, \( {e}_y^{\prime } \) and m will be nonindependent according to the DS theorem. In contrast, if the mediator–outcome relation is unconfounded (i.e., if either dm or dy is zero), (b − b) is zero, and Eq. (5) reduced to \( {e}_y^{\prime } \) = ey. In other words, in the confounder-free case, \( {e}_y^{\prime } \) and m will then be independent due to the independence of u, em, and ey. The magnitude of the dependence of \( {e}_y^{\prime } \) with m increases with the sizes of dm and dy and the magnitude of nonnormality of the involved component variables. Furthermore, when the reverse corollary of the DS theorem holds, nonnormality of ey can also be expected to affect the dependence of \( {e}_y^{\prime } \) and m due to its impact on the distribution of \( {e}_y^{\prime } \) (cf. Eq. (5)). Statistical inference methods to evaluate nonindependence beyond linear uncorrelatedness can be used to test the presence of confounders. In the following section, we introduce such a method, the Hilbert–Schmidt independence criterion (HSIC; Gretton et al., 2008).

Testing for the presence of confounders

As we showed in the previous section, under nonnormality of errors, confounders can be detected through evaluating the independence of linearly uncorrelated residuals (\( {r}_y^{\prime } \)) and “purified” mediator scores (m′). Detecting nonindependence structures in linearly uncorrelated data is extensively discussed in the area of blind source separation (Hyvärinen et al., 2001). Formally, stochastic independence of two variables, v1 and v2, is defined as E[f(v1)g(v2)] − E[f(v1)]E[g(v2)] = 0 (with E being the expected value operator) for any absolutely integrable functions f and g. Two immediate consequences emerge from this definition: First, uncorrelatedness can be considered a special case of stochastic independence (when using the identity functions f(v1) = v1 and g(v2) = v2). Although stochastic independence implies uncorrelatedness, the reverse statement does not hold; that is, uncorrelatedness does not necessarily imply independence. Second, tests of stochastic independence can, in principle, be constructed by inserting functions f and g and testing whether cov(f(v1)g(v2)) = 0 holds. Of course, inserting all possible functions is not feasible in practice, which implies that such nonlinear correlation tests induce additional Type II errors (Wiedermann, Artner, & von Eye, 2017). In the present study, we, thus, focus on the HSIC as a kernel-based measure of independence that can be shown to be an omnibus measure for detecting any form of dependence in the large sample limit (Gretton et al., 2008).

For notational simplicity, we present the HSIC for the two original variables, v1 and v2 with sample size n. Let H = I − n−111T with I being an identity matrix of order n, and 1 being a n × 1 vector of ones and 1T being the transpose of 1. Furthermore, let K and L be n × n matrices with cell entries kij = k(v1(i), v1(j)) and lij = l(v2(i), v2(j)), where k and l define Gaussian kernels of the form \( k\left({v}_1,{v}_1^T\right)=\exp \left(-{\upsigma}^{-2}{\left\Vert {v}_1-{v}_1^T\right\Vert}^2\right) \), and \( {\left\Vert {v}_1-{v}_1^T\right\Vert}^2 \) denotes the squared Euclidean distance of v1 and \( {v}_1^T \) (l follows the same definition, replacing v1 with v2; cf. Sen & Sen, 2014). Here, σ represents a bandwidth parameter. It is well-known that the performance of kernel-based methods depends on the calibration of the bandwidth parameter (Schölkopf & Smola, 2002). While σ is often chosen to be 1 (cf. Sen & Sen, 2014), the so-called median heuristic—that is, using the median of all pairwise Euclidian distances (Sriperumbudur, Fukumizu, Gretton, Lanckriet, & Schölkopf, 2009)—constitutes a popular and powerful alternative (see, e.g., Garreau, 2017).

The HSIC is defined as

$$ HSIC=n\bullet {\widehat{T}}_n, $$
(6)

where \( {\widehat{T}}_n \) is based on the trace of the matrix product KHLH, or, more specifically,

$$ {\widehat{T}}_n=1/{n}^2\mathrm{trace}(KHLH). $$
(7)

When v1 and v2 are stochastically independent, \( {\widehat{T}}_n \) approximates zero. If the HSIC significantly deviates from zero, the null hypothesis of independence of v1 and v2 can be rejected. Gretton et al. (2008) recommended approximating the null distribution of Tn as a two parameter gamma-distribution (from now on denoted as gHSIC test). In the context of OLS regressions, one has to make use of the estimated residuals instead of “true” (unobservable) errors. Sen and Sen (2014) showed that replacing “true” errors with estimated residuals alters the limiting distribution of the test statistic and suggested a bootstrap alternative for approximation of the distribution of n ∙ Tn under the null hypothesis (from now on denoted as bHSIC test).

In the context of testing unconfoundedness of the mediator–outcome path of a mediation model, we suggest the following stepwise approach (the corresponding hypotheses, statistical decisions, and implications for data analysis are also summarized as a flowchart in Fig. 1):

  1. 1.

    Regress the mediator m and the outcome y on the treatment indicator x and use the estimated residuals of the two models (i.e., \( {m}^{\prime }=m-{\widehat{a}}^{\prime }x \) and \( {y}^{\prime }=y-{\widehat{c}}^{\prime }x \)) as treatment-adjusted (“purified”) measures of m and y.

  2. 2.

    Regress the “purified” outcome y on the “purified” mediator m and extract the corresponding regression residuals \( {r}_y^{\prime }={y}^{\prime }-\widehat{b}{m}^{\prime } \).

  3. 3.

    Evaluate the independence of \( {r}_y^{\prime } \) and m using the HSIC. If the HSIC test rejects the null hypothesis of independence, confounders of the my path are likely to be present (cf. Fig. 1).

Fig. 1
figure 1

Hypothesis, statistical decisions, and implications for the proposed confounder detection approach

In Step 3, either the gHSIC or the bHSIC test can be applied. Although the bHSIC test has the advantage that the empirical size of the test can be expected to be close to the nominal significance level when using estimated residuals instead of the (unknown) “true” errors, the gHSIC test is computationally less demanding. Because a direct comparison of the performance of the two HSIC procedures is still missing in the literature, we considered both approaches in the present study. In addition, we studied the impact of bandwidth selection on the statistical performance of the gHSIC test.

Monte Carlo simulation study

To quantify the bias of the indirect-effect estimates due to confounding and to evaluate the performance of HSIC tests to detect confoundedness under nonnormality, a simulation experiment was performed using the R statistical programing environment (R Core Team, 2019). Data were generated according to the confounded mediation model given in Eq. (2)—that is, x is a binary variable reflecting control (x = 0) and treatment assignment (x = 1; the study is restricted to equal group sizes), m is a continuous mediator, and y is a continuous outcome. The mediator–outcome relation is affected by the confounder u. Model intercepts were fixed at zero and regression coefficients dm and dy were selected to account for zero, small (2% of the variance of the dependent variable), medium (13% of the variance), and large confounding effects (26% of the variance; Cohen, 1988, pp. 412–414). Regression coefficients of the mediation model (c, a, and b) were selected to reflect medium effect sizes. The two error terms ey and em were drawn from either the normal distribution or from various skewed (gamma-distributed) populations. Blanca, Arnau, López-Montiel, Bono, and Bendayan (2013) evaluated 693 empirically observed distributions from various psychological variables and observed a skewness range from – 2.49 to 2.33. More recently, Cain, Zhang, and Yuan (2017) evaluated 1,567 univariate distributions and reported skewness estimates for the empirically observed 1st and 95th percentiles of – 2.08 and 2.77 across all distributions. Because the sign of the skewness has no impact on the performance of the approach, we considered positively skewed populations with skewnesses of 0.75, 1.5, and 2.25, which can be considered representative for nonnormal variables observed in psychological research.

Two different types of confounders were considered. In one half of the simulation study, the confounder was continuous with skewnesses γu = 0 (standard normal), 0.75, 1.5, and 2.25. In the other half, the confounder was binary with group proportions P = .5, .32, .20, and .13, which is in line with skewnesses of a Bernoulli variable of \( {\upgamma}_u=\left(1-2P\right)/\sqrt{\left(1-P\right)P} \) = 0, 0.75, 1.5, and 2.25. The sample sizes were 400 and 800, which correspond to minimum detectable effect sizes (MDES) of 0.249 and 0.176 for the total effect xy of a randomized controlled trial with individual random assignment (assuming equal group sizes, a nominal significance level of 5%, and statistical power of 80%; cf. Dong & Maynard, 2013). The simulation factors were fully crossed and 500 samples were generated for each of the 4 (effect size of dm) × 4 (effect size of dy) × 4 (distribution of u) × 4 (distribution of em) × 4 (distribution of ey) × 2 (sample size) × 2 (type of confounder) = 4,096 simulation conditions.

For each variable triplet {x, m, y}, we first estimated the (biased) indirect effect ignoring the confounder u. Two outcome measures were computed for the indirect effect, the mean bias \( \widehat{\theta}-\theta \) and the mean percent bias \( \left(\widehat{\theta}-\theta \right)/\theta \) × 100 with \( \widehat{\theta} \) = \( \widehat{a}\widehat{b} \) being the OLS estimate and θ being the true indirect effect ab. In line with previous studies (see, e.g., Ames 2013; Collins, Schafer, & Kam, 2001; Wiedermann, Merkle, & von Eye, 2018), absolute biases ≥ 40% were considered significant. Next, the effect of x was partialed out of the outcome and the mediator and the “purified” outcome (y) was regressed on the “purified” mediator (m) and the independence of m and the extracted residuals \( {r}_y^{\prime }={y}^{\prime }-{\widehat{b}}^{\prime }{m}^{\prime } \) was evaluated using the HSIC tests. Three different versions of the HSIC test were applied: (1) the bHSIC test with 200 resamples and (following Sen & Sen’s, 2014, recommendation) unit bandwidths, (2) the asymptotic gHSIC test with unit bandwidths (gHSIC1), and (3) the asymptotic gHSIC test in which bandwidths were selected using the median heuristic (gHSICMd).Footnote 1 All significance tests were applied under a nominal significance level of 5%.

It is important to note that, in the present case, Type I error scenarios (i.e., rejecting the true null hypothesis of independence of predictors and residuals) only exist when all variables are normally distributed. The reason for this is that uncorrelatedness implies independence only in the multivariate normal case (cf. Hyvärinen et al., 2001). To quantify the Type I error robustness of the three HSIC tests, we used Bradley’s (1978) liberal robustness criterion—that is, a test is considered robust if the empirical Type I error rates fall within the interval 2.5%–7.5%.

Magnitude of bias

In general, the bias of indirect-effect estimates is not affected by the magnitude of asymmetry of the confounder and the error terms. Thus, we only report the results for data scenarios in which all variables are normal. Table 1 gives the mean bias and mean percent bias for the indirect effect in the presence of a (continuous or binary) confounder as a function of sample size and magnitude of the confounder effects dm and dy. In general, the measurement level of the confounder does not affect the magnitude of bias. Thus, when discussing the results, we focus on the case of a continuous confounder. As expected, no biases (i.e., mean percent biases ranging from – 1.24% to 2.35%) occur when at least one of the paths involving the confounder is zero. Furthermore, when at least one of the confounder effects is small, observed biases are still within a tolerable range of 4.71% to 21.86%. In general, biases systematically increase with the magnitude of confounding effects. Large biases (i.e., absolute biases larger than 40%) only occur for medium and large confounder effects (i.e., in six out of 32 conditions). Here, biases range from 43.96% to 66.61%. Within the parameter space of Cohen’s (1988) effect size conventions, we may conclude that biases of the indirect effect tend to be small for a broad range of effect size scenarios.

Table 1 Mean bias and mean percent bias of indirect effect estimates as a function of sample size and magnitude of the confounding effects for continuous and binary confounders

Type I error results

Next, we focus on the Type I error robustness of the HSIC tests—that is, data scenarios in which both error terms, ey and em, follow a normal distribution and the confounder is continuous and normal. In these cases, HSIC tests should not be able to detect the presence of a confounder, and rates of rejecting the null hypothesis should be close to the nominal significance level of .05. The upper panel of Table 2 gives the empirical Type I error rates for the tests as a function of sample size and magnitude of confounder effects. In general, the gHSIC1 is overly conservative in statistical decisions; that is, in all cases, the empirical Type I error rate is below Bradley’s (1978) liberal robustness interval .025–.075. Conservatism is also observed for the gHSICMd test—however, to a lesser extent. Here, in ten out of 32 conditions, empirical Type I error rates of approximate HSIC test are below Bradley’s liberal robustness interval. Of course, overly conservative decisions under H0 do not invalidate the results of the gHSIC tests per se—conservative Type I error rates usually lead to lower statistical power. In contrast, the Type I error rates for the bHSIC test are close to 5%, independent of the sample size and the magnitude of confounding effects.

Table 2 Empirical rates of rejecting the null hypothesis of independence of the gHSIC and bHSIC tests, as a function of sample size and magnitude of the confounder effects um and uy

The lower panel of Table 2 gives the observed rates of rejecting the null hypothesis of independence when the confounder is binary with equal group sizes (i.e., P = .5). Although these cases also show a skewness of zero, technically these observed rates do not necessarily correspond to Type I errors, because the confounder (which is, at the same time, a common component of m and \( {r}_y^{\prime } \)) deviates from the normal distribution. Thus, nonindependence between m and \( {r}_y^{\prime } \) exists, according to the reverse corollary of the DS theorem. However, when P = .5 the behavior of the independence tests equals the behavior in the normal case. That is, the portion of rejected null hypotheses is often close to zero for the gHSIC1 and gHSICMd tests, and close to the nominal significance level for the bHSIC test. Cases of large confounder effects constitute the only exception. Here, empirical rejection rates are slightly elevated for all three tests.

Power results

To quantify the power of the HSIC tests, we next focus on nonnormal data scenarios (i.e., cases in which γu, \( {\upgamma}_{e_m} \), and \( {\upgamma}_{e_y} \) are nonzero). Figure 2 gives the empirical power of the gHSIC1, gHSICMd, and bHSIC tests for continuous and binary confounders based on the main effects of the simulation study (i.e., aggregating across all levels of the remaining simulation factors). Specifically, power curves are given for the skewness of u, em, and ey, the magnitude of the confounding effects (dm and dy), and the sample size. As expected, the power of all tests increases with the skewness of the error terms, the magnitude of the confounding effects, and the sample size. The gHSIC1 test is less powerful than the two other procedures. The test is slightly more powerful when the confounders are continuous in nature. In addition, the bHSIC test for binary confounders has approximately the same power as the gHSIC1 test for continuous confounders. The gHSICMd and bHSIC tests are almost indistinguishable with respect to detecting continuous confounders, except for highly asymmetric outcome-related error terms (i.e., \( {\upgamma}_{e_y} \) = 2.25). There, the gHSICMd test outperforms the bHSIC test. Test performance is largely unaffected by the measurement level of the confounder, except for large effects of dm and slightly unbalanced binary confounders (i.e., P = .32, which implies γu = 0.75). In both cases, a slight power advantage of the gHSICMd test is observed when the confounder is binary. Most importantly, the skewness of a continuous confounder has virtually no impact on the statistical power of the tests. In other words, both the distributional shape and the measurement level of the unobserved confounder are largely negligible when testing the unconfoundedness assumption.

Fig. 2
figure 2

Observed power of gHSIC and bHSIC tests to detect binary and continuous confounders for the main effects of the skewness of the confounder (γu), the skewnesses of the error terms (\( {\upgamma}_{e_m} \) and \( {\upgamma}_{e_y} \)), the magnitude of the confounding effects (dm and dy), and the sample size (n)

Next, we focus on the performance of the tests with respect to Cohen’s (1988) widely used 80% power criterion. Figures 3, 4, and 5 give the statistical power of the tests as a function of those factors that have the largest impact on test performance—that is, the skewness of the error terms and the magnitude of the confounding effects (across all levels of γu). Because the gHSIC1 test showed the lowest power, we only focus on the gHSICMd and bHSIC tests. Because, for n = 400, highly skewed error terms (i.e., skewnesses of 2.25) and large confounder effects are needed in order to achieve sufficient power, we focus on selected results for n = 800 (all remaining results are given in an online supplement). In general, observed power rates ≥ 80% were most prevalent for the gHSICMd test in the presence of a binary confounder (i.e., in 20 out of the 3 (γey) × 3 (γem) × 3 (dy) × 3 (dm) = 81 data scenarios—i.e., 24.7%; cf. Fig. 4). In particular, focusing on data scenarios in which the percent biases due to confounding can be expected to be large (i.e., when both confounder effects are large or at least one confounder effect is large and the other one is medium-sized; see Table 1), confounders can be detected with sufficient power when the skewnesses of the error terms are ≥ 1.5. When ey is highly skewned (i.e., γey = 2.25), confounders can be detected with power larger than 80% even when the skewness of em is small (γem = 0.75). Figures 4 and 5 give the power for the gHSICMd and bHSIC tests when the unobserved confounder is continuous. Again, in the presence of influential confounding, the tests are able to detect unobserved confounders when both error terms are highly skewed or when at least one error term is highly skewed and the other error term is moderately skewed. Furthermore, for large confounder effects, the bHSIC test shows power values ≥ 80% even when em is moderately skewed (i.e., \( {\upgamma}_{e_m} \) = 1.5) and ey is slightly asymmetric (i.e., \( {\upgamma}_{e_y} \) = 0.75).

Fig. 3
figure 3

Statistical power of the gHSICMd test to detect a binary confounder for n = 800. Black squares correspond to an empirical power ≥ 80%

Fig. 4
figure 4

Statistical power of the gHSICMd test to detect a continuous confounder for n = 800. Black squares correspond to an empirical power ≥ 80%

Fig. 5
figure 5

Statistical power of the bHSIC test to detect a continuous confounder for n = 800. Black squares correspond to an empirical power ≥ 80%

Empirical example

To illustrate the application of the proposed method to minimize the risk of obtaining biased estimates in a real-world data example, we use data from the Job Search Intervention Study (JOBS II; cf. Vinokur, Price, & Schul, 1995). JOBS II is a randomized field study that evaluates the efficacy of a job-training intervention for unemployed workers. The subjects in the treatment condition participated in job search skills seminars, and the subjects in the control condition received a booklet that describes job search tips. Vinokur et al. (1995) and Vinokur and Schul (1997) reported that the subjects in the treatment condition showed better employment and mental health outcomes, due to the subjects’ enhanced confidence in their job-searching skills. In a reanalysis, Imai, Keele, and Tingley (2010) found a small, albeit significant, negative mediation effect of the intervention on workers’ depressive symptoms through workers’ job search self-efficacy. The program increased perceived job search efficacy, which, in turn, decreased depressive symptoms. In the present study, we used the same data (n = 1,193; 373 in the control condition and 820 in the treatment condition) to evaluate the assumption of unconfoundedness inherent to the mediator (job search efficacy)–outcome (depression) part of the model. Both variables are measured on continuous scales. The job search self-efficacy composite measure is based on six items (each ranging from 1 = not at all confident to 5 = a great deal confident). The composite measure of depressive symptoms is based on the subscale of 11 items (ranging from 1 = not at all to 5 = extremely) of the Hopkins Symptom Checklist (Derogatis, Lipman, Riekles, Uhlenhuth, & Covi, 1974). We considered the covariates sex (52.8% female), age (M = 36.8 yrs., SD = 10.6), race (19.4% non-White), marital status (31.4% never married, 45.1% married, 3.4% separated, 17.8% divorced, and 2.2% widowed), education (7.0% did not complete high school, 32.1% completed high school, 36.0% completed some college, 14.8% had 4 years of college, and 10.1% had > 4 years of college), income (20.7% < $15,000, 23.6% $15,000–$24,000, 24.1% $25,000–$39,000, 11.7% $40,000–$49,000, 20.0% $50,000+), economic hardship at baseline (M = 3.1, SD = 1.0), and occupation (17.6% professionals, 17.4% managerial, 23.6% clerical/kindred, 7.5% sales workers, 11.1% craftsmen/kindred, 11.6% operatives/kindred, 11.2% laborers/service workers). In addition, the mediation models were adjusted for depressive symptoms at baseline (M = 1.9, SD = 0.6), job-seeking efficacy prior to treatment (M = 3.7, SD = 0.8), level of anxiety prior treatment (measured using 11 items ranging from 1 to 5, where higher scores indicate more severe anxious symptomology; M = 1.9, SD = 0.7), and level of self-esteem (based on eight items from Rosenberg’s, 1965, self-esteem scale ranging from 1 to 5, where higher scores indicate higher self-esteem; M = 4.1, SD = 0.7).

Overall, three different mediation models were estimated. All three models used treatment status as the predictor, job-seeking self-efficacy after 6 months as the mediator, and depressive symptoms after 6 months as the outcome. Nonparametric bootstrapping using 500 resamples was applied in order to evaluate the significance of the indirect effect. In Model I, we adjusted for demographic background information (i.e., age, gender, race, marital status, and education). In Model II, we additionally incorporated information related to the subjects’ financial situation (i.e., income, economic hardship at baseline, and occupation). In Model III, we added psychological background variables (i.e., depression at baseline, anxiety prior treatment, and job-seeking self-efficacy prior to treatment). For each model (I–III), we regressed the treatment-/covariate-purified outcome on the treatment-/covariate-purified mediator and extracted the residuals for further analyses. The gHSICMd and bHSIC tests (based on 500 resamples and applying the median heuristic for bandwidth selection) were used to evaluate the independence of the (treatment-/covariate-purified) mediator and the corresponding residuals. Again, a significant test indicated the presence of unmeasured confounders that would hamper causal interpretation of the obtained mediation effect.

Table 3 gives the estimated regression parameters, indirect-effect estimates, measures of skewness of residuals, and HSIC test results for the three models. In each mediation model, we observed a small, albeit significant, negative mediation effect (ranging from –0.029 to –0.020), which is in line with the previous results of Imai, Keele, and Yamamoto (2010). The estimated model residuals deviated from symmetry, as indicated by D’Agostino’s (1971) z values. The residual skewnesses of the mediator model range from –0.76 to –0.69, and those of the outcome model range from 1.05 to 1.34 (all ps < .001) and thus fulfill the distributional requirement of error nonnormality. HSIC values systematically decrease from 0.651 (Model I) to 0.206 (Model III), indicating that the magnitude of error dependence decreases with every additional set of considered covariates. However, even after including demographic and financial background information, we obtain an HSIC of 0.488 and both HSIC tests reject the null hypothesis of independence. In other words, influential confounders may still be present. This is no longer the case, however, when adjusting for psychological background variables (Model III). Here, the HSIC drops to 0.206 and both HSIC tests retain the null hypothesis of independence. In other words, among the three candidate models, Model III is the one with the lowest risk of a biased indirect-effect estimate. Although the three indirect-effect estimates are not remarkably different from each other (i.e., additionally adjusting for financial and psychological background characteristics reduces the indirect effect by 0.007 points and leads to a slightly narrower confidence interval), and although all three models suggest that the treatment increased job-seeking self-efficacy, which in turn decreased depression, one has gained knowledge that adjusting the mediation model for the demographic, financial, and psychological background variables leads to sufficiently independent errors according to the two HSIC tests, which is a fundamental requirement for interpretation of the observed indirect effect as causal.

Table 3 Parameter estimates of three different mediation models to evaluate the impact of the intervention on depressive symptoms through job-seeking efficacy (t0 = baseline, t1 = prior to intervention, t2 = 6 months after intervention)

Discussion

In the present study, we introduced nonindependence properties of linearly uncorrelated nonnormal variables. These properties can be used to evaluate unconfoundedness assumptions imposed on standard mediation models. Independence tests that detect nonindependence beyond first-order correlations were reviewed and their performance to identify confounding in the mediation setting was evaluated using an extensive simulation study. It is important to note that the magnitude of confounding as well as test performances in terms of Type I error and power rates were evaluated using cutoff criteria that have previously been applied in the field of quantitative psychological research. However, cutoff criteria are not universal; rather, they depend on research context. For example, statistical power of 80% may be sufficient for some research fields but not for others. Of course, ultimately, only the researcher can judge the practical relevance of the statistical results.

Overall, the bootstrap (bHSIC) test with unit bandwidth and the gamma-approximated (gHSIC) test with median heuristic-based bandwidth perform equally well in terms of statistical power, while the former approach shows better Type I error control. Considering, however, the fact that the gamma-approximated test is computationally more efficient than the bootstrap procedure,Footnote 2 the gHSICMd is a reasonable compromise and performs well in various data scenarios. The gHSIC1 test should generally be avoided, due to distorted Type I error rates and low statistical power.

The proposed approach constitutes a diagnostic tool for the evaluation of the critical assumptions that are necessary to endow parameter estimates with causal meaning. However, both the proposed unconfoundedness tests and existing tests of exogeneity (e.g., Hausman-type procedures) must be applied with circumspection. Specifically, when applying these tests, one is usually not able to control the Type II error risk. In the present context, this implies that rival explanations exist for failing to reject the null hypothesis of independence: The null hypothesis might be retained because (1) one has adjusted for all influential confounders, and data requirements (i.e., the nonnormality of errors) for valid application of the independence tests have been fulfilled; (2) one has adjusted for all influential confounders, and data requirements have not been fulfilled; and (3) one has failed to adjust for all influential confounders, and data requirements have not been fulfilled. Carefully evaluating the necessary data requirements before applying HSIC tests is, thus, indispensable for arriving at meaningful results. Here, two factors are of particular importance for the present approach: the magnitude of the nonnormality of the error terms and the sample size.

Nonnormality of error terms

The proposed method assumes that the errors of the mediation model are nonnormally distributed (i.e., exhibiting nonzero skewness and/or excess kurtosis). Although the present study focused on the case of skewed errors, simulation results suggest that the distributional asymmetry of a continuous confounder does not affect the power of the tests (the asymmetry of a binary confounder slightly reduced the power to detect confounding). In contrast, the asymmetry of error terms systematically increases the power to detect confounding, and thus constitutes the most important prerequisite to validly apply the method. Although the present study focused on asymmetric error distributions, it is important to note that the approach can also be applied when errors are symmetric but nonnormal (i.e., distributions with nonzero excess kurtosis). The reason for this is that the reverse corollary of the DS theorem (see above) does not make any statements about the type of nonnormality, and nonindependence of the mediator and the outcome-specific error term will also hold for symmetric nonnormal errors. Evaluating the statistical power of the HSIC tests under these distributional scenarios will be material for future work.

Various previous studies have repeatedly demonstrated that the normality assumption is often violated in real-world data (e.g., Blanca et al., 2013; Micceri, 1989). One theoretical explanation of why variables (and error terms) are likely to deviate from the normal distribution was given, for example, by Beale and Mallows (1959). The error term of any regression model usually captures factors outside the model in addition to measurement error. Even when the error term is defined as a mixture of independent and normally distributed variables with zero means, the resulting error will show elevated kurtosis whenever the involved component variables show unequal variances. Because unequal variances are likely to occur in practice, error terms are also likely to deviate from the normal distribution. Similarly, when normally distributed component variables occur in segments of the domain of the error (so-called constrained mixtures of distributions), the resulting distribution will be skewed when the mixing weights are unequal (cf. Miranda & von Zuben, 2015). Despite these theoretical justifications, it is important to keep in mind that not every form of error nonnormality makes variables eligible for testing unconfoundedness. For example, independence tests are likely to give biased results when error nonnormality emerges due to outliers or ceiling/floor effects. Thus, carefully evaluating the distributional properties of the estimated model residuals must be a part of applying unconfoundedness tests. Here, data visualizations (e.g., Seier & Bonett, 2011), omnibus normality tests (Jarque & Bera, 1980; Shapiro & Wilk, 1965), or more specific tests of skewness (cf. D’Agostino, 1971; Randles, Fligner, Policello, & Wolfe, 1980) and kurtosis (Anscombe & Glynn, 1983) are available to evaluate distributional requirements.

Previous studies on mediation analysis under nonnormality almost exclusively focused on quantifying potential biases in statistical inference (e.g., Kisbu-Sakarya, MacKinnon, & Miočević, 2014; Ng & Lin, 2016) and the development of robust mediation models (e.g., Yuan & MacKinnon, 2014; Zhang, 2014; Zu & Yuan, 2010). Although the normality of errors constitutes an assumption that is routinely made in OLS regression modeling (White & MacDonald, 1980), it is important to note that normality is not needed for OLS estimates to be unbiased, consistent, and most efficient among all other linear estimatorsFootnote 3 (Fox, 2008). As a consequence, extracted regression residuals are unbiased estimates of the “true” errors, and significance tests (such as the HSIC procedures) based on these residuals can be expected to give valid results under error nonnormality. Since nonnormality may affect the standard errors of OLS estimates, bootstrapping techniques can be used to guarantee valid statistical inference about the indirect effect (MacKinnon, Lockwood, & Williams, 2004). Thus, error nonnormality should not be prematurely dismissed as a source of bias. Instead, nonnormality might carry important information that can be used to gain deeper insight into the data-generating mechanism (cf. Wiedermann & von Eye, 2015).

Required sample size

The present simulation results suggest that large sample sizes are preferable in order to achieve acceptable power to detect confounding. The present simulation study took a perspective that may be predominant in practice—that is, determining the sample size for a randomized controlled trial with a focus on the total effect of the intervention, and using mediation analysis to investigate a secondary hypothesis that evaluates why the intervention effect occurs. For this setup, the gHSICMd and bHSIC tests are reasonably powerful for those scenarios in which indirect-effect biases due to confounding can be considered substantial, as long as the error distributions are moderately or highly asymmetric. In other words, even when the existence of biases due to confounding is assumed or known and one asks questions concerning the extent (not the existence) of confounding, the HSIC tests are able to detect influential confounding.

When the evaluation of potential mediation mechanisms constitutes the primary study goal, larger sample sizes are usually needed. Fritz and MacKinnon (2007), for example, showed that the sample sizes needed for the detection of mediation effects can be incredibly large, in particular when the direct effect of a model is close to zero. This result is less surprising if we consider the power characteristics of product coefficients. For example, making use of Cohen, Cohen, West, and Aiken’s (2003, p. 92) approach for computing power for regression coefficients, one concludes that a sample size of 2,636 would be required in order to detect an indirect effect when one of the paths involved in the indirect effect has a small effect size and the other path has a medium-sized effect.Footnote 4 Even when one path shows a large effect, a sample size of 1,153 is required when the effect of the other path is small. The presented simulation results suggest that the HSIC tests can be expected to perform reasonably well for such “large-scale” scenarios.

Power analysis tool

To estimate a priori the statistical power to detect confounding, we provide a Monte Carlo–based power tool (see the online supplement) implemented in R (R Core Team, 2019). For reasons of computational efficiency, power is estimated for the gHSICMd procedure. To estimate a priori the required sample size, one needs to estimate the values for direct and indirect effects and the magnitude of the error asymmetry. The magnitude of the confounding effects is usually not known and can be treated as a sensitivity parameter when determining the minimum confounding scenarios for which a statistical power—for example, ≥ 80%—is plausible. Ideally, parameter values for the direct/indirect effects and error asymmetries should be estimated using a representative sample from a pilot study. If pilot data are not available, parameter values may be retrieved from previous investigations. While the parameter estimates involved in a mediation effect are regularly reported in practice, error skewness estimates may be obtained indirectly from sample descriptive statistics. For example, in the bivariate regression case, the skewness of y can be expressed as \( {\upgamma}_y={\uprho}_{my}^3{\upgamma}_m+{\left(1-{\uprho}_{my}^2\right)}^{3/2}{\upgamma}_{e_y} \), with ρmy being the Pearson correlation of m and y (cf. Dodge & Yadegari, 2010), and an estimate for the error skewness is available through \( {\upgamma}_{e_y}=\left({\upgamma}_y-{\uprho}_{my}^3{\upgamma}_m\right)/{\left(1-{\rho}_{my}^2\right)}^{3/2} \). Alternatively, the magnitude of error asymmetry can also be considered as a sensitivity parameter. Of course, the power tool can also be used in a post-hoc fashion—that is, by analyzing the power of previous studies, one can obtain important information for planning a future study (cf. Hox, 2010).

Research on mediation analysis has made tremendous progress by way of combining statistical modeling with the counterfactual framework of causation. From this line of research, it is known that, even in randomized experiments, strong assumptions are required in order to derive causal statements from mediation models. The presented method serves as a diagnostic tool to critically evaluate these assumptions and to reduce the risk of erroneous conclusions concerning the mechanisms behind interventions.