1 Introduction

In educational and behavioral research or related areas, researchers often want to answer a scientific question whether the change in X affects the change in Y (i.e., a causal relationship from X to Y). In this case, X is referred to as the explanatory variable (or independent variable), and Y is referred as the response variable (dependent variable or outcome variable). In this article, X is a Bernoulli random variable (one or zero), a discrete random variable, or a continuous random variable, and Y is a continuous random variable. The causal relationship is often denoted by \(X \rightarrow Y\), and the direction of arrow matters. If a researcher controls the change in X, it is called an experimental study. Otherwise, it is called an observational study. Caveats and challenges of causal inference through an observational study have been widely discussed in various disciplinary areas (Glass et al. 2013; Kang 2014; Rohrer 2018; Adams et al. 2019).

In mediation analysis (Baron and Kenny 1986; Hayes 2013), suppose there are two causal paths from X to Y. The first path is \(X \rightarrow M \rightarrow Y\), and M is referred to as the mediating variable (mediator or intermediate variable). The second path is \(X \rightarrow Y\) not through M. This mediation model is graphically illustrated in Fig. 1, and it is the simplest mediation model presented by Hayes (2013) which is highly cited (more than 22,000 as of now) by many researchers. The first path is often referred to as the indirect effect, and the second path is often referred to as the direct effect. Hayes (2013) presented more complex mediation models than the one shown in Fig. 1.

Fig. 1
figure 1

The simplest mediation model presented by Hayes (2013)

Since Hayes (2013) developed the PROCESS macro in the statistical software SPSS, many researchers in social science have applied to prove complex causal paths, and the mediation models are still popular (Caniëls 2019; Garcia et al. 2018; Seli et al. 2017; Zhang et al. 2019; Emery et al. 2019; Zhu et al. 2019; Villaluz and Hechanova 2018; Manuti and Giancaspro 2019). Some of these studies were observational, and some were experimental. For instance, Zhang et al. (2019) asked subjects to report their values of two explanatory variables (workplace ostracism and leader-member exchange), and Seli et al. (2017) randomly assigned subjects to either an experimental group (manipulating motivation) or a control group (no motivation). The term “mediation” has an implicit direction of relationship, and many researchers used the mediation model (or a more complex mediation model) based on data collected in an observational study (i.e., no randomization of X). Even though many researchers have conducted observational studies, the direction of a causal relationship has been justified by a common sense or an acceptable theory.

The purpose of this article is not to discourage the use of a mediation model. The purpose is to remind researchers the impact of an unobserved “precursor variable” (denoted by W in this article) on the probability of concluding the presence of an indirect effect and/or a direct effect. We use the term “precursor variable” to refer that W precedes X, M, and Y in a causal relationship as shown in Fig. 2. The necessity of randomization (i.e., controlling X) has been accepted in scientific communities since it was first advocated by Ronald A. Fisher (Fisher 1925; Hall 2007). However, it has been shown that the randomization does not completely remove bias in a certain mediation analysis. VanderWeele (2010) and Imai et al. (2010a, 2010b) presented bias formulas under regression models, and Hong et al. (2018) discussed various methods of sensitivity analysis (Hong et al. 2015, 2018; VanderWeele 2015).

Fig. 2
figure 2

The simplest mediation model presented by Hayes (2013) with a precursor variable W

In this article, we review the basic regression-based mediation models (Figs. 1, 2; Hayes 2013), thoroughly present formulas in terms of the regression parameters to quantify the bias in the estimation of indirect effect and direct effect in the absence of W, and use simulations to demonstrate its consequence (inflated Type I error rate). We assume that readers of this article have background knowledge of multiple linear regression models and basic theorems in mathematical statistics including: \(Cov( X, X ) = V(X)\) and

$$\begin{aligned} Cov \left( \sum _{i=1}^n a_i X_i , \sum _{j=1}^m b_i Y_i \right) = \sum _{i=1}^n \sum _{j=1}^m a_i b_i Cov( X_i, Y_i) \end{aligned}$$

where an uppercase letter denotes a random variable and a lowercase letter denotes a constant real number. In Sect. 2, we consider a simple case when the causal relationship from X to Y does not involve a mediator M, and a formula will demonstrate that randomization of X is enough to remove bias in this simple case. In Sect. 3, we consider a case when the causal relationship involves a mediator M, and another formula will clearly demonstrate that randomization of X is not enough to remove bias in this more complicated case. In Sect. 4, we present simulation results which demonstrate seriously inflated Type I error rates even in an experimental study. In Sect. 5, a numerical example is provided based on the data collected in the United States and previously analyzed by Guber (1999).

2 Causal relationship without a mediator

Let X denote an explanatory variable of interest (observed), Y denote a response variable of interest (observed), and W denote an omitted (unobserved) precursor variable which may affect X and/or Y (see Fig. 3). Suppose a researcher assumes the simple linear model

$$\begin{aligned} Y = b_0 + b_1 X + \epsilon \, , \end{aligned}$$

and suppose the true relationship among the three random variables, W, X, and Y, is given by the two linear models

$$\begin{aligned} X & = \gamma _0 + \gamma _1 W + \epsilon _1^* \, , \\ Y &= \beta _0 + \beta _1 X + \beta _2 W + \epsilon _2^* \, . \end{aligned}$$

Figure 3 graphically illustrates this situation. If we let \(\sigma _W^2 = V(W)\) and \(\sigma _i^2 = V(\epsilon _i^*)\) for \(i=1,2\), then the variances V(X) and V(Y) can be expressed in terms of \(\sigma _W^2\), \(\sigma _1^2\), \(\sigma _2^2\), and the regression parameters.

Fig. 3
figure 3

The true relationship among W, X, and Y (top) and the assumed relationship without W (bottom)

Here our goal is to express \(b_1\) (the quantity estimated by the researcher) in terms of \(\beta _1\), \(\beta _2\), \(\gamma _1\), \(\sigma _W\), and \(\sigma _1\). Under the simple linear model assumed by the researcher,

$$\begin{aligned} Cov(X, Y) = Cov( X, b_0 + b_1 X + \epsilon ) = b_1 V(X) \, , \end{aligned}$$

so the estimand \(b_1\) can be expressed as

$$\begin{aligned} b_1 = \frac{Cov(X, Y)}{V(X)} \, . \end{aligned}$$

Under the true relationship,

$$\begin{aligned} Cov(X, Y) &= Cov( X, \beta _0 + \beta _1 X + \beta _2 W + \epsilon _2^* ) \\ &= \beta _1 V(X) + \beta _2 Cov( X, W ) \\ & = \beta _1 V(X) + \beta _2 Cov( \gamma _0 + \gamma _1 W + \epsilon _1^*, W ) \\ & = \beta _1 V(X) + \beta _2 \gamma _1 V(W) \, . \end{aligned}$$

Therefore, the researcher eventually estimates a complex quantity

$$\begin{aligned} b_1 = \frac{\beta _1 V(X) + \beta _2 \gamma _1 V(W)}{V(X)} = \beta _1 + \beta _2 \gamma _1 \left( \frac{V(W)}{V(X)} \right) \, , \end{aligned}$$

where \(V(W) = \sigma _W^2\) and \(V(X) = \gamma _1^2 \sigma _W^2 + \sigma _1^2\). The researcher can accomplish \(b_1 = \beta _1\) by randomization of X (i.e., \(\gamma _1 = 0\)). When in an observational study (i.e,. \(\gamma _1 \ne 0\)), the researcher can estimate \(b_1 = \beta _1\) when \(\gamma _1 \ne 0\), but this is out of researcher’s control.

3 Causal relationship with a mediator

Consider a more complex case when M is a mediating variable (observed) in the causal path from X to Y. Suppose a researcher assumes the mediation model (Hayes 2013) with the following two linear models:

$$\begin{aligned} M & = a_0 + a_1 X + \epsilon _1 \, , \nonumber \\ Y &= b_0 + b_1 X + b_2 M + \epsilon _2 \, . \end{aligned}$$
(1)

Suppose the true relationship among the four random variables, W, X, M, and Y, is given by the three linear models

$$\begin{aligned} X & = \gamma _0 + \gamma _1 W + \epsilon _1^* \, , \nonumber \\ M &= \alpha _0 + \alpha _1 X + \alpha _2 W + \epsilon _2^* \, , \nonumber \\ Y &= \beta _0 + \beta _1 X + \beta _2 M + \beta _3 W + \epsilon _3^* \, . \end{aligned}$$
(2)

Figure 4 graphically illustrates this scenario. Let \(\sigma _W^2 = V(W)\) and \(\sigma _i^2 = V(\epsilon _i^* )\).

Under the assumed model of Fig. 4 (bottom), \(a_1 b_2\) quantifies the indirect effect of X on Y (through M), and \(b_1\) quantifies the direct effect (not through M, but possibly through other mediators). Our goal is to express \(a_1 b_2\) and \(b_1\) in terms of the model parameters in the true relationship among W, X, M, and Y which are denoted by the Greek letters in Fig. 4 (top).

Fig. 4
figure 4

The true relationship among W, X, M, and Y (top) and the assumed relationship without W (bottom)

3.1 The impact of an unobserved precursor on indirect effect

For the indirect effect \(a_1 b_2\), we first note that \(a_1\) and \(b_2\) quantify the relationships among X, M, and Y as follow:

$$\begin{aligned} a_1 & = \frac{Cov(X,M)}{V(X)} \, , \\ b_2 &= \frac{V(X) Cov(M,Y) - Cov(X,Y) Cov(X,M) }{V(X) V(M) - [Cov(X,M)]^2} \, . \end{aligned}$$

Note that \(a_1\) and \(b_2\) can be expressed as

$$\begin{aligned} a_1 &= \alpha _1 + \alpha _2 \gamma _1 \left( \frac{ \sigma _W^2 }{ \gamma _1^2 \sigma _W^2 + \sigma _1^2 } \right) \, , \\ b_2 &= \beta _2 + \alpha _2 \beta _3 \left( \frac{ \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \gamma _W^2 + \sigma _1^2) \sigma _2^2 } \right) \, . \end{aligned}$$

To this end, the researcher’s estimand \(a_1 b_2\) for the indirect effect becomes a very complex quantity

$$\begin{aligned} a_1 b_2 = \left( \frac{ (\alpha _1 + \alpha _2 \gamma _1) \sigma _W^2 }{ \gamma _1^2 \sigma _W^2 + \sigma _1^2 } \right) \left( \frac{ (\beta _2 + \alpha _2 \beta _3) \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \gamma _W^2 + \sigma _1^2) \sigma _2^2 } \right) \, . \end{aligned}$$
(3)

Appendix 1 provides a detail explanation of the derivation of Eq. (3).

There are two cases when \(a_1 b_2 = \alpha _1 \beta _2\). The first case is when the precursor W does not affect the mediator M (i.e., \(\alpha _2 = 0\)). The second case is when W does not affect both explanatory variable X and response variable Y (i.e., \(\gamma _1 = \beta _3 = 0\)). In either case, randomization of X (i.e., \(\gamma _1 = 0\)) is not sufficient to avoid the bias. In particular, when \(\alpha _1 \beta _2 = 0\) is true with \(\alpha _1 \ne 0\) and \(\beta _2 = 0\), researchers will estimate \(a_1 b_2 \ne 0\) which leads to an inflated Type I error rate.

3.2 The impact on direct effect

For the direct effect \(b_1\), it can be similarly shown that

$$\begin{aligned} b_1 = \frac{V(M) Cov(X,Y) - Cov(M,Y) Cov(X,M) }{V(X) V(M) - [Cov(X,M)]^2} \, . \end{aligned}$$

In the true causal relationship, which starts from the precursor W, it can be shown that

$$\begin{aligned} b_1 = \beta _1 + \beta _3 \left( \frac{ \gamma _1 \sigma _2^2 \sigma _W^2 - \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + ( \gamma _1^2 \sigma _W^2 + \sigma _1^2 ) \sigma _2^2 } \right) \, . \end{aligned}$$
(4)

Appendix 2 provides a detail explanation of the derivation of Eq. (4).

There are four cases when \(b_1 = \beta _1\). The first case is when W does not affect Y (i.e., \(\beta _3 = 0\)). The second case is when \(\gamma _1 = \alpha _1 = 0\), and the third case is when \(\gamma _1 = \alpha _2 = 0\). The fourth case is \(\gamma _1 \sigma _2^2 \sigma _W^2 = \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2\) which is nearly uninterpretable. Again, in any of the four cases, the randomization of X (i.e., \(\gamma _1 = 0\)) does not guarantee \(b_1 = \beta _1\). In particular, when \(\beta _1 = 0\) is true, researchers will estimate \(b_1 \ne 0\) which leads to an inflated type I error rate. Furthermore, if \(\gamma _1 = 0\) and all \(\beta _1\), \(\alpha _1\), \(\alpha _2\), and \(\beta _3\) have the same sign, depending on their magnitudes (and magnitudes of \(\sigma _W\) and \(\sigma _2\)), \(b_1\) and \(\beta _1\) may result in opposite signs which leads to an awkward conclusion.

4 Simulation study

4.1 Simulation designs

To demonstrate the danger of mediation analysis due to an omitted precursor W even with manipulation of X (i.e., \(\gamma _1 = 0\)), a simulation study was conducted. For all simulation scenarios, we fixed \(\beta _1 = 0\), \(\beta _2 = 0\), \(\sigma _W = 5\), \(\sigma _1 = 1\), \(\sigma _2 = 5\), and \(\sigma _3 = 0.5\) and varied \(\alpha _1 = 0, 0.5, 1\), \(\alpha _2 = -1, -0.5, 0\), and \(\beta _3 = -0.5, 0, 0.5\) to create twenty-seven scenarios such that \(\beta _1 = 0\) (no direct effect) and \(\alpha _1 \beta _2 = 0\) (no indirect effect). For each scenario, we considered sample sizes \(n = 50, 100, 500, 1000\), and we estimated (1) the probability of concluding the presence of indirect effect (i.e., \(a_1 b_2 \ne 0\)) and (2) the probability of concluding the presence of direct effect (i.e., \(b_1 \ne 0\)) based on bias-adjusted 95% confidence intervals (CIs) with 2000 bootstrap samples. The nonparametric method is known to be robust for the mediation analysis, so it is recommended for mediation analysis in practice (Hayes 2009; Preacher and Hayes 2008; Hair et al. 2014). Each scenario was replicated 1000 times for estimating the probability of concluding \(a_1 b_2 \ne 0\) and the probability of concluding \(b_1 \ne 0\). In other words, out of 1000 replications per scenario, the proportion of times that a 95% CI for \(a_1 b_2\) excludes zero and the proportion of times that a 95% CI for \(b_1\) excludes zero are calculated.

4.2 Simulation results

The simulation results are summarized in Table 1. In seven scenarios (10, 12, 15, 19, 21, 22, and 24), the probability of concluding \(b_1 \ne 0\) and the probability of concluding \(a_1 b_2 \ne 0\) were substantially greater than 0.05 when \(n = 50\), and the probabilities increased as n increased. The bias-adjusted method accurately estimated \(b_1\) and \(a_1 b_2\) according to equations (4) and (3) (see Table 2), so there was a higher chance of eliminating \(b_1 = 0\) and \(a_1 b_2 = 0\) as n increased (i.e., shorter CIs around the true values of \(b_1\) and \(a_1 b_2\)). For the other twenty scenarios, where \(b_1 = 0\) and \(a_1 b_2 = 0\) according to Eqs. (4) and (3), the probabilities of concluding \(b_1 \ne 0\) and \(a_1 b_2 \ne 0\) were close to 0.05. In some of the twenty scenarios, the probabilities slightly exceeded 0.05 when \(n = 50\), and these results imply that the bootstrap method may require a larger sample size than \(n = 50\) in some cases in order to properly estimate the uncertainty in the interval estimation.

Table 1 The probability of concluding the presence of direct effect (\(b_1 \ne 0\)) and the presence of indirect effect (\(a_1 b_2 \ne 0\)) in the absence of W based on biased-adjusted 95% CIs with 2000 bootstrap samples (1000 replications per scenario)
Table 2 The mean of the sampling distribution of bootstrap estimates for the direct effect (\(b_1\)) and for the indirect effect (\(a_1 b_2\)) based on 2000 bootstrap samples (1000 replications per scenario)

5 Example

Guber (1999) discussed the impact of omitting a (confounding) variable in the association between public school expenditures and academic performance (measured by the average SAT score). In the data, an individual is a state (not a student), and the data consist of all 50 states in the United States.

In this section, to demonstrate the role of a potential precursor variable, we turn our focus on the association between the state average annual salary of teachers in public schools (denoted X; in thousands of US dollars) and the state average SAT score (denoted Y). If we consider the simple linear regression \(Y = c_0 + c_1 X + \epsilon\), the estimated slope is \({\hat{c}}_1 = -5.5396\). This result suggests a higher average annual salary is associated with a lower average SAT performance (\(p = 0.001\)).

An important (potential mediating) variable may be the percentage (%) of students taking the SAT in each state (denoted by M). Suppose we model \(M = a_0 + a_1 X + \epsilon _1\) and \(Y = b_0 + b_1 X + b_2 M + \epsilon _2\) as shown in Eq. (1). The estimated regression parameters are \({\hat{a}}_1 = 2.7783\), \({\hat{b}}_1 = 2.1804\), and \({\hat{b}}_2 = -2.7787\). Conditioning on the potential mediating variable (% SAT takers), it appears that a higher average annual salary is associated with a higher average SAT performance (\(p = 0.039\)). The opposite signs between \({\hat{c}}_1 = -5.54\) and \({\hat{b}}_1 = 2.18\) are due to the fact \({\hat{c}}_1 = {\hat{b}}_1 + {\hat{a}}_1 {\hat{b}}_2\), where \({\hat{a}}_1 > 0\) and \({\hat{b}}_2 < 0\). More SAT takers tend to lower the state average SAT score (Guber 1999).

Given the adjusted estimate \({\hat{b}}_1 = 2.18\) with the small p value (\(p = 0.039\)), can we conclude that the state average annual salary and the state average SAT performance are positively associated? Let us consider a potential precursor variable, the state expenditure per students in public schools (denoted W; in thousands of US dollars), as W may affect X, M, and Y. Using the three regression models presented in Eq. (2), we can estimate the regression parameters as shown at the bottom of Fig. 5 with \({\hat{\sigma }}_W^2 = 1.8201\), \({\hat{\sigma }}_1^2 = 8.4214\), and \({\hat{\sigma }}_2^2 = 425.7959\). Note that \({\hat{\beta }}_1 = -0.31\) with \(p = 0.853\) suggest no strong evidence for the positive association. The opposite signs between \({\hat{b}}_1 = 2.18\) and \({\hat{\beta }}_1 = -0.31\) can be explained by Eq. (4) with the estimated regression parameters,

$$\begin{aligned} {\hat{b}}_1 = {\hat{\beta }}_1 + {\hat{\beta }}_3 \left( \frac{ {\hat{\gamma }}_1 {\hat{\sigma }}_2^2 {\hat{\sigma }}_W^2 - {\hat{\alpha }}_1 {\hat{\alpha }}_2 {\hat{\sigma }}_1^2 {\hat{\sigma }}_W^2 }{ {\hat{\alpha }}_2^2 {\hat{\sigma }}_1^2 {\hat{\sigma }}_W^2 + ( {\hat{\gamma }}_1^2 {\hat{\sigma }}_W^2 + {\hat{\sigma }}_1^2 ) {\hat{\sigma }}_2^2 } \right) \, . \end{aligned}$$

The relatively big positive estimates \({\hat{\beta }}_3 = 13.33\), \({\hat{\gamma }}_1 = 3.79\), and \({\hat{\sigma }}_2^2 = 425.8\) could alter from the negative \({\hat{\beta }}_1 = -0.31\) (or nearly zero) to positive \({\hat{b}}_1 = 2.18\) (with the small p-value) by omitting the precursor variable (state expenditure) which might affect the whole mechanism among the state average annual salary of teachers, the % students taking SAT, and the state SAT performance. The estimated indirect association \({\hat{\alpha }}_1 {\hat{\beta }}_2 = -5.32\) further suggests that a higher state average salary of teachers does not help improving the average SAT performance.

Fig. 5
figure 5

Changing the direction and magnitude of association between the state average salary (explanatory) and the state average SAT score (response) by including % SAT takers (mediator) and by including state expenditure (precursor) in the analysis. (This is a state-level association, not student-level.)

6 Discussion

The primary focus of the simulation study was an inflated Type I error rate (i.e., a higher chance of falsely claiming direct and/or indirect effect) in a mediation analysis due to an unobserved precursor variable. Though the example in Sect. 5 was an observational study, the message was similar. Even for an experimental study, the simulation results alert researchers about the inflated Type I error rate (worse when n is larger), and the expressions (Eqs. (3) and (4)) explain why the bias exists even after randomization of X (i.e., forcing \(\gamma _1 = 0\)) in the presence of a mediator (Fig. 4). The uncomfortable fact is that an experimental design, which can change \(V(X) = \gamma _1^2 \sigma _W^2 + \sigma _1^2 = \sigma _1^2\), cannot fix the bias in particular null cases. For instance, if \(\alpha _1 \ne 0\) and \(\beta _2 = 0\) (i.e., zero indirect effect), according to Eq. (3),

$$\begin{aligned} a_1 b_2 = \alpha _1 \alpha _2 \beta _3 \left( \frac{\sigma _1^2 \sigma _W^2}{\alpha _2^2 \sigma _1^2 \sigma _W^2 + \sigma _1^2 \sigma _2^2} \right) = \alpha _1 \alpha _2 \beta _3 \left( \frac{\sigma _W^2}{\alpha _2^2 \sigma _W^2 + \sigma _2^2} \right) \end{aligned}$$

which is independent of \(V(X) = \sigma _1^2\). For another instance, if \(\beta _1 = 0\) (i.e., zero direct effect), according to equation (4),

$$\begin{aligned} b_1 = \beta _3 \left( \frac{ \gamma _1 \sigma _2^2 \sigma _W^2 - \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + ( \gamma _1^2 \sigma _W^2 + \sigma _1^2 ) \sigma _2^2 } \right) = -\beta _3 \left( \frac{\alpha _1 \alpha _2 \sigma _W^2}{\alpha _2^2 \sigma _W^2 + \sigma _2^2} \right) \end{aligned}$$

which is independent of \(V(X) = \sigma _1^2\) again.

In this article, we focused on one precursor variable. In practice, there can be two or more precursor variables. The take-home messages are clear for reducing bias in the estimation of \(\alpha _1 \beta _2\) (indirect effect) and \(\beta _1\) (direct effect). First, randomize X (i.e., \(\gamma _1 = 0\)) if possible. Second, during data collection, researchers are suggested to record variables which are potentially related to M and Y and adjust them in the regression analysis. A large sample size can tolerate a mild degree of over-fitting due to many W’s adjusted in the model. The adjustment will help researchers estimate \(\alpha _1 \beta _2\) and \(\beta _1\) with a small bias and perform hypothesis testing with a reduced inflation of Type I error rate.