# Regression-based mediation analysis: a formula for the bias due to an unobserved precursor variable

## Abstract

Researchers want to know whether the change in an explanatory variable X affects the change in a response variable Y (i.e., X causes Y). In practice, there can be two causal paths from X to Y, the path through a mediating variable M (indirect effect) and the path not through M (direct effect). The parameter estimation and hypothesis testing can be performed by a regression-based mediation model. It is already known that randomization of X is not enough for unbiased estimation, and the bias due to an unobserved variable has been discussed in literature but often overlooked. In this article, we first review the challenge under a simple mediation model, then we provide a formula for the exact bias due to an unobserved precursor variable W, the variable which potentially causes the changes in X, M, and/or Y. We present simulation studies to demonstrate the impact of an unobserved precursor variable on hypothesis testing for indirect effect and direct effect. The simulation results show that the inflation of type I error is serious particularly in a large sample study. To numerically demonstrate the formula of the exact bias, a popular data set published in a journal of statistics education is revisited, and we quantify why the conclusion of data analysis can be different before and after accounting for the precursor variable. The result shall remind the importance of a precursor variable in mediation analysis.

## Introduction

In educational and behavioral research or related areas, researchers often want to answer a scientific question whether the change in X affects the change in Y (i.e., a causal relationship from X to Y). In this case, X is referred to as the explanatory variable (or independent variable), and Y is referred as the response variable (dependent variable or outcome variable). In this article, X is a Bernoulli random variable (one or zero), a discrete random variable, or a continuous random variable, and Y is a continuous random variable. The causal relationship is often denoted by $$X \rightarrow Y$$, and the direction of arrow matters. If a researcher controls the change in X, it is called an experimental study. Otherwise, it is called an observational study. Caveats and challenges of causal inference through an observational study have been widely discussed in various disciplinary areas (Glass et al. 2013; Kang 2014; Rohrer 2018; Adams et al. 2019).

In mediation analysis (Baron and Kenny 1986; Hayes 2013), suppose there are two causal paths from X to Y. The first path is $$X \rightarrow M \rightarrow Y$$, and M is referred to as the mediating variable (mediator or intermediate variable). The second path is $$X \rightarrow Y$$ not through M. This mediation model is graphically illustrated in Fig. 1, and it is the simplest mediation model presented by Hayes (2013) which is highly cited (more than 22,000 as of now) by many researchers. The first path is often referred to as the indirect effect, and the second path is often referred to as the direct effect. Hayes (2013) presented more complex mediation models than the one shown in Fig. 1.

Since Hayes (2013) developed the PROCESS macro in the statistical software SPSS, many researchers in social science have applied to prove complex causal paths, and the mediation models are still popular (Caniëls 2019; Garcia et al. 2018; Seli et al. 2017; Zhang et al. 2019; Emery et al. 2019; Zhu et al. 2019; Villaluz and Hechanova 2018; Manuti and Giancaspro 2019). Some of these studies were observational, and some were experimental. For instance, Zhang et al. (2019) asked subjects to report their values of two explanatory variables (workplace ostracism and leader-member exchange), and Seli et al. (2017) randomly assigned subjects to either an experimental group (manipulating motivation) or a control group (no motivation). The term “mediation” has an implicit direction of relationship, and many researchers used the mediation model (or a more complex mediation model) based on data collected in an observational study (i.e., no randomization of X). Even though many researchers have conducted observational studies, the direction of a causal relationship has been justified by a common sense or an acceptable theory.

The purpose of this article is not to discourage the use of a mediation model. The purpose is to remind researchers the impact of an unobserved “precursor variable” (denoted by W in this article) on the probability of concluding the presence of an indirect effect and/or a direct effect. We use the term “precursor variable” to refer that W precedes X, M, and Y in a causal relationship as shown in Fig. 2. The necessity of randomization (i.e., controlling X) has been accepted in scientific communities since it was first advocated by Ronald A. Fisher (Fisher 1925; Hall 2007). However, it has been shown that the randomization does not completely remove bias in a certain mediation analysis. VanderWeele (2010) and Imai et al. (2010a, 2010b) presented bias formulas under regression models, and Hong et al. (2018) discussed various methods of sensitivity analysis (Hong et al. 2015, 2018; VanderWeele 2015).

In this article, we review the basic regression-based mediation models (Figs. 1, 2; Hayes 2013), thoroughly present formulas in terms of the regression parameters to quantify the bias in the estimation of indirect effect and direct effect in the absence of W, and use simulations to demonstrate its consequence (inflated Type I error rate). We assume that readers of this article have background knowledge of multiple linear regression models and basic theorems in mathematical statistics including: $$Cov( X, X ) = V(X)$$ and

\begin{aligned} Cov \left( \sum _{i=1}^n a_i X_i , \sum _{j=1}^m b_i Y_i \right) = \sum _{i=1}^n \sum _{j=1}^m a_i b_i Cov( X_i, Y_i) \end{aligned}

where an uppercase letter denotes a random variable and a lowercase letter denotes a constant real number. In Sect. 2, we consider a simple case when the causal relationship from X to Y does not involve a mediator M, and a formula will demonstrate that randomization of X is enough to remove bias in this simple case. In Sect. 3, we consider a case when the causal relationship involves a mediator M, and another formula will clearly demonstrate that randomization of X is not enough to remove bias in this more complicated case. In Sect. 4, we present simulation results which demonstrate seriously inflated Type I error rates even in an experimental study. In Sect. 5, a numerical example is provided based on the data collected in the United States and previously analyzed by Guber (1999).

## Causal relationship without a mediator

Let X denote an explanatory variable of interest (observed), Y denote a response variable of interest (observed), and W denote an omitted (unobserved) precursor variable which may affect X and/or Y (see Fig. 3). Suppose a researcher assumes the simple linear model

\begin{aligned} Y = b_0 + b_1 X + \epsilon \, , \end{aligned}

and suppose the true relationship among the three random variables, W, X, and Y, is given by the two linear models

\begin{aligned} X & = \gamma _0 + \gamma _1 W + \epsilon _1^* \, , \\ Y &= \beta _0 + \beta _1 X + \beta _2 W + \epsilon _2^* \, . \end{aligned}

Figure 3 graphically illustrates this situation. If we let $$\sigma _W^2 = V(W)$$ and $$\sigma _i^2 = V(\epsilon _i^*)$$ for $$i=1,2$$, then the variances V(X) and V(Y) can be expressed in terms of $$\sigma _W^2$$, $$\sigma _1^2$$, $$\sigma _2^2$$, and the regression parameters.

Here our goal is to express $$b_1$$ (the quantity estimated by the researcher) in terms of $$\beta _1$$, $$\beta _2$$, $$\gamma _1$$, $$\sigma _W$$, and $$\sigma _1$$. Under the simple linear model assumed by the researcher,

\begin{aligned} Cov(X, Y) = Cov( X, b_0 + b_1 X + \epsilon ) = b_1 V(X) \, , \end{aligned}

so the estimand $$b_1$$ can be expressed as

\begin{aligned} b_1 = \frac{Cov(X, Y)}{V(X)} \, . \end{aligned}

Under the true relationship,

\begin{aligned} Cov(X, Y) &= Cov( X, \beta _0 + \beta _1 X + \beta _2 W + \epsilon _2^* ) \\ &= \beta _1 V(X) + \beta _2 Cov( X, W ) \\ & = \beta _1 V(X) + \beta _2 Cov( \gamma _0 + \gamma _1 W + \epsilon _1^*, W ) \\ & = \beta _1 V(X) + \beta _2 \gamma _1 V(W) \, . \end{aligned}

Therefore, the researcher eventually estimates a complex quantity

\begin{aligned} b_1 = \frac{\beta _1 V(X) + \beta _2 \gamma _1 V(W)}{V(X)} = \beta _1 + \beta _2 \gamma _1 \left( \frac{V(W)}{V(X)} \right) \, , \end{aligned}

where $$V(W) = \sigma _W^2$$ and $$V(X) = \gamma _1^2 \sigma _W^2 + \sigma _1^2$$. The researcher can accomplish $$b_1 = \beta _1$$ by randomization of X (i.e., $$\gamma _1 = 0$$). When in an observational study (i.e,. $$\gamma _1 \ne 0$$), the researcher can estimate $$b_1 = \beta _1$$ when $$\gamma _1 \ne 0$$, but this is out of researcher’s control.

## Causal relationship with a mediator

Consider a more complex case when M is a mediating variable (observed) in the causal path from X to Y. Suppose a researcher assumes the mediation model (Hayes 2013) with the following two linear models:

\begin{aligned} M & = a_0 + a_1 X + \epsilon _1 \, , \nonumber \\ Y &= b_0 + b_1 X + b_2 M + \epsilon _2 \, . \end{aligned}
(1)

Suppose the true relationship among the four random variables, W, X, M, and Y, is given by the three linear models

\begin{aligned} X & = \gamma _0 + \gamma _1 W + \epsilon _1^* \, , \nonumber \\ M &= \alpha _0 + \alpha _1 X + \alpha _2 W + \epsilon _2^* \, , \nonumber \\ Y &= \beta _0 + \beta _1 X + \beta _2 M + \beta _3 W + \epsilon _3^* \, . \end{aligned}
(2)

Figure 4 graphically illustrates this scenario. Let $$\sigma _W^2 = V(W)$$ and $$\sigma _i^2 = V(\epsilon _i^* )$$.

Under the assumed model of Fig. 4 (bottom), $$a_1 b_2$$ quantifies the indirect effect of X on Y (through M), and $$b_1$$ quantifies the direct effect (not through M, but possibly through other mediators). Our goal is to express $$a_1 b_2$$ and $$b_1$$ in terms of the model parameters in the true relationship among W, X, M, and Y which are denoted by the Greek letters in Fig. 4 (top).

### The impact of an unobserved precursor on indirect effect

For the indirect effect $$a_1 b_2$$, we first note that $$a_1$$ and $$b_2$$ quantify the relationships among X, M, and Y as follow:

\begin{aligned} a_1 & = \frac{Cov(X,M)}{V(X)} \, , \\ b_2 &= \frac{V(X) Cov(M,Y) - Cov(X,Y) Cov(X,M) }{V(X) V(M) - [Cov(X,M)]^2} \, . \end{aligned}

Note that $$a_1$$ and $$b_2$$ can be expressed as

\begin{aligned} a_1 &= \alpha _1 + \alpha _2 \gamma _1 \left( \frac{ \sigma _W^2 }{ \gamma _1^2 \sigma _W^2 + \sigma _1^2 } \right) \, , \\ b_2 &= \beta _2 + \alpha _2 \beta _3 \left( \frac{ \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \gamma _W^2 + \sigma _1^2) \sigma _2^2 } \right) \, . \end{aligned}

To this end, the researcher’s estimand $$a_1 b_2$$ for the indirect effect becomes a very complex quantity

\begin{aligned} a_1 b_2 = \left( \frac{ (\alpha _1 + \alpha _2 \gamma _1) \sigma _W^2 }{ \gamma _1^2 \sigma _W^2 + \sigma _1^2 } \right) \left( \frac{ (\beta _2 + \alpha _2 \beta _3) \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \gamma _W^2 + \sigma _1^2) \sigma _2^2 } \right) \, . \end{aligned}
(3)

Appendix 1 provides a detail explanation of the derivation of Eq. (3).

There are two cases when $$a_1 b_2 = \alpha _1 \beta _2$$. The first case is when the precursor W does not affect the mediator M (i.e., $$\alpha _2 = 0$$). The second case is when W does not affect both explanatory variable X and response variable Y (i.e., $$\gamma _1 = \beta _3 = 0$$). In either case, randomization of X (i.e., $$\gamma _1 = 0$$) is not sufficient to avoid the bias. In particular, when $$\alpha _1 \beta _2 = 0$$ is true with $$\alpha _1 \ne 0$$ and $$\beta _2 = 0$$, researchers will estimate $$a_1 b_2 \ne 0$$ which leads to an inflated Type I error rate.

### The impact on direct effect

For the direct effect $$b_1$$, it can be similarly shown that

\begin{aligned} b_1 = \frac{V(M) Cov(X,Y) - Cov(M,Y) Cov(X,M) }{V(X) V(M) - [Cov(X,M)]^2} \, . \end{aligned}

In the true causal relationship, which starts from the precursor W, it can be shown that

\begin{aligned} b_1 = \beta _1 + \beta _3 \left( \frac{ \gamma _1 \sigma _2^2 \sigma _W^2 - \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + ( \gamma _1^2 \sigma _W^2 + \sigma _1^2 ) \sigma _2^2 } \right) \, . \end{aligned}
(4)

Appendix 2 provides a detail explanation of the derivation of Eq. (4).

There are four cases when $$b_1 = \beta _1$$. The first case is when W does not affect Y (i.e., $$\beta _3 = 0$$). The second case is when $$\gamma _1 = \alpha _1 = 0$$, and the third case is when $$\gamma _1 = \alpha _2 = 0$$. The fourth case is $$\gamma _1 \sigma _2^2 \sigma _W^2 = \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2$$ which is nearly uninterpretable. Again, in any of the four cases, the randomization of X (i.e., $$\gamma _1 = 0$$) does not guarantee $$b_1 = \beta _1$$. In particular, when $$\beta _1 = 0$$ is true, researchers will estimate $$b_1 \ne 0$$ which leads to an inflated type I error rate. Furthermore, if $$\gamma _1 = 0$$ and all $$\beta _1$$, $$\alpha _1$$, $$\alpha _2$$, and $$\beta _3$$ have the same sign, depending on their magnitudes (and magnitudes of $$\sigma _W$$ and $$\sigma _2$$), $$b_1$$ and $$\beta _1$$ may result in opposite signs which leads to an awkward conclusion.

## Simulation study

### Simulation designs

To demonstrate the danger of mediation analysis due to an omitted precursor W even with manipulation of X (i.e., $$\gamma _1 = 0$$), a simulation study was conducted. For all simulation scenarios, we fixed $$\beta _1 = 0$$, $$\beta _2 = 0$$, $$\sigma _W = 5$$, $$\sigma _1 = 1$$, $$\sigma _2 = 5$$, and $$\sigma _3 = 0.5$$ and varied $$\alpha _1 = 0, 0.5, 1$$, $$\alpha _2 = -1, -0.5, 0$$, and $$\beta _3 = -0.5, 0, 0.5$$ to create twenty-seven scenarios such that $$\beta _1 = 0$$ (no direct effect) and $$\alpha _1 \beta _2 = 0$$ (no indirect effect). For each scenario, we considered sample sizes $$n = 50, 100, 500, 1000$$, and we estimated (1) the probability of concluding the presence of indirect effect (i.e., $$a_1 b_2 \ne 0$$) and (2) the probability of concluding the presence of direct effect (i.e., $$b_1 \ne 0$$) based on bias-adjusted 95% confidence intervals (CIs) with 2000 bootstrap samples. The nonparametric method is known to be robust for the mediation analysis, so it is recommended for mediation analysis in practice (Hayes 2009; Preacher and Hayes 2008; Hair et al. 2014). Each scenario was replicated 1000 times for estimating the probability of concluding $$a_1 b_2 \ne 0$$ and the probability of concluding $$b_1 \ne 0$$. In other words, out of 1000 replications per scenario, the proportion of times that a 95% CI for $$a_1 b_2$$ excludes zero and the proportion of times that a 95% CI for $$b_1$$ excludes zero are calculated.

### Simulation results

The simulation results are summarized in Table 1. In seven scenarios (10, 12, 15, 19, 21, 22, and 24), the probability of concluding $$b_1 \ne 0$$ and the probability of concluding $$a_1 b_2 \ne 0$$ were substantially greater than 0.05 when $$n = 50$$, and the probabilities increased as n increased. The bias-adjusted method accurately estimated $$b_1$$ and $$a_1 b_2$$ according to equations (4) and (3) (see Table 2), so there was a higher chance of eliminating $$b_1 = 0$$ and $$a_1 b_2 = 0$$ as n increased (i.e., shorter CIs around the true values of $$b_1$$ and $$a_1 b_2$$). For the other twenty scenarios, where $$b_1 = 0$$ and $$a_1 b_2 = 0$$ according to Eqs. (4) and (3), the probabilities of concluding $$b_1 \ne 0$$ and $$a_1 b_2 \ne 0$$ were close to 0.05. In some of the twenty scenarios, the probabilities slightly exceeded 0.05 when $$n = 50$$, and these results imply that the bootstrap method may require a larger sample size than $$n = 50$$ in some cases in order to properly estimate the uncertainty in the interval estimation.

## Example

Guber (1999) discussed the impact of omitting a (confounding) variable in the association between public school expenditures and academic performance (measured by the average SAT score). In the data, an individual is a state (not a student), and the data consist of all 50 states in the United States.

In this section, to demonstrate the role of a potential precursor variable, we turn our focus on the association between the state average annual salary of teachers in public schools (denoted X; in thousands of US dollars) and the state average SAT score (denoted Y). If we consider the simple linear regression $$Y = c_0 + c_1 X + \epsilon$$, the estimated slope is $${\hat{c}}_1 = -5.5396$$. This result suggests a higher average annual salary is associated with a lower average SAT performance ($$p = 0.001$$).

An important (potential mediating) variable may be the percentage (%) of students taking the SAT in each state (denoted by M). Suppose we model $$M = a_0 + a_1 X + \epsilon _1$$ and $$Y = b_0 + b_1 X + b_2 M + \epsilon _2$$ as shown in Eq. (1). The estimated regression parameters are $${\hat{a}}_1 = 2.7783$$, $${\hat{b}}_1 = 2.1804$$, and $${\hat{b}}_2 = -2.7787$$. Conditioning on the potential mediating variable (% SAT takers), it appears that a higher average annual salary is associated with a higher average SAT performance ($$p = 0.039$$). The opposite signs between $${\hat{c}}_1 = -5.54$$ and $${\hat{b}}_1 = 2.18$$ are due to the fact $${\hat{c}}_1 = {\hat{b}}_1 + {\hat{a}}_1 {\hat{b}}_2$$, where $${\hat{a}}_1 > 0$$ and $${\hat{b}}_2 < 0$$. More SAT takers tend to lower the state average SAT score (Guber 1999).

Given the adjusted estimate $${\hat{b}}_1 = 2.18$$ with the small p value ($$p = 0.039$$), can we conclude that the state average annual salary and the state average SAT performance are positively associated? Let us consider a potential precursor variable, the state expenditure per students in public schools (denoted W; in thousands of US dollars), as W may affect X, M, and Y. Using the three regression models presented in Eq. (2), we can estimate the regression parameters as shown at the bottom of Fig. 5 with $${\hat{\sigma }}_W^2 = 1.8201$$, $${\hat{\sigma }}_1^2 = 8.4214$$, and $${\hat{\sigma }}_2^2 = 425.7959$$. Note that $${\hat{\beta }}_1 = -0.31$$ with $$p = 0.853$$ suggest no strong evidence for the positive association. The opposite signs between $${\hat{b}}_1 = 2.18$$ and $${\hat{\beta }}_1 = -0.31$$ can be explained by Eq. (4) with the estimated regression parameters,

\begin{aligned} {\hat{b}}_1 = {\hat{\beta }}_1 + {\hat{\beta }}_3 \left( \frac{ {\hat{\gamma }}_1 {\hat{\sigma }}_2^2 {\hat{\sigma }}_W^2 - {\hat{\alpha }}_1 {\hat{\alpha }}_2 {\hat{\sigma }}_1^2 {\hat{\sigma }}_W^2 }{ {\hat{\alpha }}_2^2 {\hat{\sigma }}_1^2 {\hat{\sigma }}_W^2 + ( {\hat{\gamma }}_1^2 {\hat{\sigma }}_W^2 + {\hat{\sigma }}_1^2 ) {\hat{\sigma }}_2^2 } \right) \, . \end{aligned}

The relatively big positive estimates $${\hat{\beta }}_3 = 13.33$$, $${\hat{\gamma }}_1 = 3.79$$, and $${\hat{\sigma }}_2^2 = 425.8$$ could alter from the negative $${\hat{\beta }}_1 = -0.31$$ (or nearly zero) to positive $${\hat{b}}_1 = 2.18$$ (with the small p-value) by omitting the precursor variable (state expenditure) which might affect the whole mechanism among the state average annual salary of teachers, the % students taking SAT, and the state SAT performance. The estimated indirect association $${\hat{\alpha }}_1 {\hat{\beta }}_2 = -5.32$$ further suggests that a higher state average salary of teachers does not help improving the average SAT performance.

## Discussion

The primary focus of the simulation study was an inflated Type I error rate (i.e., a higher chance of falsely claiming direct and/or indirect effect) in a mediation analysis due to an unobserved precursor variable. Though the example in Sect. 5 was an observational study, the message was similar. Even for an experimental study, the simulation results alert researchers about the inflated Type I error rate (worse when n is larger), and the expressions (Eqs. (3) and (4)) explain why the bias exists even after randomization of X (i.e., forcing $$\gamma _1 = 0$$) in the presence of a mediator (Fig. 4). The uncomfortable fact is that an experimental design, which can change $$V(X) = \gamma _1^2 \sigma _W^2 + \sigma _1^2 = \sigma _1^2$$, cannot fix the bias in particular null cases. For instance, if $$\alpha _1 \ne 0$$ and $$\beta _2 = 0$$ (i.e., zero indirect effect), according to Eq. (3),

\begin{aligned} a_1 b_2 = \alpha _1 \alpha _2 \beta _3 \left( \frac{\sigma _1^2 \sigma _W^2}{\alpha _2^2 \sigma _1^2 \sigma _W^2 + \sigma _1^2 \sigma _2^2} \right) = \alpha _1 \alpha _2 \beta _3 \left( \frac{\sigma _W^2}{\alpha _2^2 \sigma _W^2 + \sigma _2^2} \right) \end{aligned}

which is independent of $$V(X) = \sigma _1^2$$. For another instance, if $$\beta _1 = 0$$ (i.e., zero direct effect), according to equation (4),

\begin{aligned} b_1 = \beta _3 \left( \frac{ \gamma _1 \sigma _2^2 \sigma _W^2 - \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + ( \gamma _1^2 \sigma _W^2 + \sigma _1^2 ) \sigma _2^2 } \right) = -\beta _3 \left( \frac{\alpha _1 \alpha _2 \sigma _W^2}{\alpha _2^2 \sigma _W^2 + \sigma _2^2} \right) \end{aligned}

which is independent of $$V(X) = \sigma _1^2$$ again.

In this article, we focused on one precursor variable. In practice, there can be two or more precursor variables. The take-home messages are clear for reducing bias in the estimation of $$\alpha _1 \beta _2$$ (indirect effect) and $$\beta _1$$ (direct effect). First, randomize X (i.e., $$\gamma _1 = 0$$) if possible. Second, during data collection, researchers are suggested to record variables which are potentially related to M and Y and adjust them in the regression analysis. A large sample size can tolerate a mild degree of over-fitting due to many W’s adjusted in the model. The adjustment will help researchers estimate $$\alpha _1 \beta _2$$ and $$\beta _1$$ with a small bias and perform hypothesis testing with a reduced inflation of Type I error rate.

## References

• Adams, R. C., Challenger, A., Bratton, L., Boivin, J., Bott, L., Powell, G., et al. (2019). Claims of causality in health news: A randomised trial. BMC Medicine, 17(1), 91. https://doi.org/10.1186/s12916-019-1324-7.

• Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173–1182.

• Caniëls, M. C. (2019). Proactivity and supervisor support in creative process engagement. European Management Journal, 37(2), 188–197.

• Emery, C., Booth, J. E., Michaelides, G., & Swaab, A. J. (2019). The importance of being psychologically empowered: Buffering the negative effects of employee perceptions of Leader–Member Exchange differentiation. Journal of Occupational and Organizational Psychology, 92(3), 566–592.

• Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh: Oliver and Boyd.

• Garcia, P. R. J. M., Restubog, S. L. D., Ocampo, A. C., Wang, L., & Tang, R. L. (2018). Role modeling as a socialization mechanism in the transmission of career adaptability across generations. Journal of Vocational Behavior, 111, 39–48.

• Glass, T. A., Goodman, S. N., Hernán, M. A., & Samet, J. M. (2013). Causal inference in public health. Annual Review of Public Health, 34, 61–75.

• Guber, D. L. (1999). Getting what you pay for: The debate over equity in public school expenditures. Journal of Statistics Education,7(2).

• Hair J. F., Hult, G. T. M., Ringle, C., & Sarstedt, M. (2014). A primer on partial least squares structural equation modeling (PLS SEM).

• Hall, N. S. (2007). R. A. Fisher and his advocacy of randomization. Journal of the History of Biology, 40, 295–325.

• Hayes, A. F. (2009). Beyond Baron and Kenny: Statistical mediation analysis in the new millennium. Communication Monographs, 76(4), 408–420.

• Hayes, A. F. (2013). Methodology in the social sciences. Introduction to mediation, moderation, and conditional process analysis: A regression-based approach. New York: Guilford Press.

• Hong, G., Deutsch, J., & Hill, H. D. (2015). Ratio-of-mediator-probability weighting for causal mediation analysis in the presence of treatment-by-mediator interaction. Journal of Educational and Behavioral Statistics, 40(3), 307–340.

• Hong, G., Qin, X., & Yang, F. (2018). Weighting-based sensitivity analysis in causal mediation studies. Journal of Educational and Behavioral Statistics, 43(1), 32–56.

• Imai, K., Keele, L., & Tingley, D. (2010a). A general approach to causal mediation analysis. Psychological Methods, 15(4), 309–334.

• Imai, K., Keele, L., & Yamamoto, T. (2010b). Identification, inference and sensitivity analysis for causal mediation effects. Statistical Science, 25, 51–71.

• Kang, J. (2014). Overview and practice of causal inference in observational studies. Biometrics & Biostatistics International Journal, 1(1), 00002.

• Manuti, A., & Giancaspro, M. (2019). People make the difference: An explorative study on the relationship between organizational practices, employees’ resources, and organizational behavior enhancing the psychology of sustainability and sustainable development. Sustainability, 11(5), 1499.

• Preacher, K. J., & Hayes, A. F. (2008). Asymptotic and resampling strategies for assessing and comparing indirect effects in multiple mediator models. Behavior Research Methods, 40, 879–891.

• Rohrer, J. M. (2018). Thinking clearly about correlations and causation: Graphical causal models for observational data. Advances in Methods and Practices in Psychological Science, 1(1), 27–42.

• Seli, P., Schacter, D. L., Risko, E. F., & Smilek, D. (2017). Increasing participant motivation reduces rates of intentional and unintentional mind wandering. Psychological Research, 83(5), 1057–1069.

• VanderWeele, T. J. (2010). Bias formulas for sensitivity analysis for direct and indirect effects. Epidemiology, 21(4), 540–551.

• VanderWeele, T. J. (2015). Explanation in causal inference: Methods for mediation and interaction. New York: Oxford University Press.

• Villaluz, V., & Hechanova, M. (2018). Ownership and leadership in building an innovation culture. Leadership & Organization Development Journal, 40(2), 138–150.

• Zhang, L., Fan, C., Deng, Y., Lam, C. F., Hu, E., & Wang, L. (2019). Exploring the interpersonal determinants of job embeddedness and voluntary turnover: A conservation of resources perspective. Human Resource Management Journal, 29(3), 413–432.

• Zhu, H., Wong, N., & Huang, M. (2019). Does relationship matter? How social distance influences perceptions of responsibility on anthropomorphized environmental objects and conservation intentions. Journal of Business Research, 95, 62–70.

## Author information

Authors

### Corresponding author

Correspondence to Joonghak Lee.

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Appendices

### Appendix 1: Bias in the estimation of indirect effect

Under the assumed model $$M = a_0 + a_1 X + \epsilon _1$$, the covariance between X and M is given by

\begin{aligned} Cov(X, M) = Cov( X, a_1 X) = a_1 V(X) \end{aligned}

so,

\begin{aligned} a_1 = \frac{ Cov(X, M) }{ V(X) } \, . \end{aligned}
(5)

The assumed model $$Y = b_0 + b_1 X + b_2 M + \epsilon _2$$ can be written as $$Y - b_1 X = b_0 + b_2 M + \epsilon _2$$, so

\begin{aligned} b_2 = \frac{ Cov(M, Y - b_1X) }{ V(M) } = \frac{ Cov( Y, M ) - b_1 Cov( X, M ) }{ V(M) } \, . \end{aligned}
(6)

Similarly, it can be written as $$Y - b_2 M = b_0 + b_1 X + \epsilon _2$$, so

\begin{aligned} b_1 = \frac{ Cov(X, Y - b_2 M) }{ V(X) } = \frac{ Cov( X, Y ) - b_2 Cov( X, M ) }{ V(X) } \, . \end{aligned}

Therefore, Eq. (6) can be written as

\begin{aligned} b_2 &= \frac{ Cov( Y, M ) - \left( \frac{ Cov( X, Y ) - b_2 Cov( X, M ) }{ V(X) } \right) Cov( X, M ) }{ V(M) } \\ &= \frac{ V(X) Cov(Y, M) - Cov(X, Y) Cov(X, M) + b_2 [Cov(X, M) ]^2 }{ V(X) V(M) } \, . \end{aligned}

By solving for $$b_2$$,

\begin{aligned} b_2 = \frac{ V(X) Cov(Y, M) - Cov(X, Y) Cov(X, M) }{ V(X) V(M) - [Cov(X, M) ]^2 } \, . \end{aligned}
(7)

Now consider the three true models:

\begin{aligned} X & = \gamma _0 + \gamma _1 W + \epsilon _1^* \, , \\ M & = \alpha _0 + \alpha _1 X + \alpha _2 W + \epsilon _2^* , \\ Y & = \beta _0 + \beta _1 X + \beta _2 M + \beta _3 W + \epsilon _3^* \, . \end{aligned}

From the true relationships among W, X, M, and Y, we have

\begin{aligned} V(X) = V(\gamma _1 W + \epsilon _1^*) = \gamma _1^2 \sigma _W^2 + \sigma _1^2 \end{aligned}
(8)

and

\begin{aligned} Cov(X, M) & = Cov( X, \alpha _1 X + \alpha _2 W ) \nonumber \\ & = \alpha _1 V(X) + \alpha _2 Cov(X, W) \nonumber \\ & = \alpha _1 (\gamma _1^2 \sigma _W^2 + \sigma _1^2) + \alpha _2 \gamma _1 \sigma _W^2 \, . \end{aligned}
(9)

Therefore, Eq. (5) can be expressed as the true model parameters as

\begin{aligned} a_1 = \alpha _1 + \frac{ \alpha _2 \gamma _1 \sigma _W^2 }{ \gamma _1^2 \sigma _W^2 + \sigma _1^2 } \, . \end{aligned}

To express $$b_2$$ in Eq. (7) in terms of the true model parameters, we need to rewrite V(M), Cov(YM), and Cov(XY) as follows. For V(M), we first express

\begin{aligned} M &= \alpha _0 + \alpha _1 (\gamma _0 + \gamma _1 W + \epsilon _1^*) + \alpha _2 W + \epsilon _2^* \\ & = (\alpha _0 + \alpha _1 \gamma _0) + (\alpha _1 \gamma _1 + \alpha _2)W + \alpha _1 \epsilon _1^* + \epsilon _2^* \, , \end{aligned}

so

\begin{aligned} V(M) = (\alpha _1\gamma _1 + \alpha _2) \sigma _W^2 + \alpha _1^2 \sigma _1^2 + \sigma _2^2 \, . \end{aligned}
(10)

For Cov(YM), note that

\begin{aligned} Cov(Y, M) & = Cov(M, Y) \\ &= Cov( M, \beta _1 X + \beta _2 M + \beta _3 W ) \\ &= \beta _1 Cov(M, X) + \beta _2 V(M) + \beta _3 Cov(M, W) \, , \end{aligned}

where we previously wrote

\begin{aligned} Cov( X, M ) &= \alpha _1 \sigma _1^2 + (\alpha _1 \gamma _1 + \alpha _2) \gamma _1 \sigma _W^2 \, , \\ V(M) &= (\alpha _1\gamma _1 + \alpha _2) \sigma _W^2 + \alpha _1^2 \sigma _1^2 + \sigma _2^2 \, . \end{aligned}

Further note that $$Cov(X, W) = Cov(\gamma _1 W, W) = \gamma _1 \sigma _W^2$$, and

\begin{aligned} Cov( M, W ) &= Cov( \alpha _1 X + \alpha _2 W, W ) \\ &= \alpha _1 Cov(X, W) + \alpha _2 \sigma _W^2 \\ &= \alpha _1 \gamma _1 \sigma _W^2 + \alpha _2 \sigma _W^2 \\ & = ( \alpha _1 \gamma _1 + \alpha _2 ) \sigma _W^2 \, . \end{aligned}

Therefore,

\begin{aligned} Cov(Y, M) & = \beta _1 [ \alpha _1 \sigma _1^2 + (\alpha _1 \gamma _1 + \alpha _2) \gamma _1 \sigma _W^2 ] \nonumber \\&+ \beta _2 [ (\alpha _1\gamma _1 + \alpha _2) \sigma _W^2 + \alpha _1^2 \sigma _1^2 + \sigma _2^2 ] \nonumber \\&+ \beta _3 [ ( \alpha _1 \gamma _1 + \alpha _2 ) \sigma _W^2 ] \, . \end{aligned}
(11)

For Cov(XY), we can replace our previous results as

\begin{aligned} Cov(X, Y) &= Cov( X, \beta _1 X + \beta _2 M + \beta _3 W ) \nonumber \\ &= \beta _1 V(X) + \beta _2 Cov(X, M) + \beta _3 Cov(X, W) \nonumber \\ & = \beta _1 ( \gamma _1^2 \sigma _W^2 + \sigma _1^2 ) + \beta _2 [ \alpha _1 \sigma _1^2 + (\alpha _1 \gamma _1 + \alpha _2) \gamma _1 \sigma _W^2 ] \nonumber \\&+ \beta _3 \gamma _1 \sigma _W^2 \, . \end{aligned}
(12)

After some algebraic work, it can be shown that the denominator of $$b_2$$ in Eq. (7) can be simplified as

\begin{aligned} V(X) V(M) - [Cov(X, M)]^2 = \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \sigma _W^2 + \sigma _1^2) \sigma _2^2 \, , \end{aligned}

and the numerator of $$b_2$$ in Eq. (7) can be expressed as

\begin{aligned} V(X) Cov(Y, M) - Cov(X, Y) Cov(X, M) &= \beta _2 [ \alpha _2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \sigma _W^2 + \sigma _1^2) \sigma _2^2 ] \\&+ \beta _3 \alpha _2 \sigma _1^2 \sigma _W^2 \, . \end{aligned}

To this end, we can express $$b_2$$ in Eq. (7) as

\begin{aligned} b_2 = \frac{ \beta _2 [ \alpha _2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \sigma _W^2 + \sigma _1^2) \sigma _2^2 ] + \beta _3 \alpha _2 \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \sigma _W^2 + \sigma _1^2) \sigma _2^2 } \end{aligned}

which simplifies as

\begin{aligned} b_2 = \beta _2 + \beta _3 \left( \frac{ \alpha _2 \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + ( \gamma _1^2 \sigma _W^2 + \sigma _1^2 ) \sigma _2^2 } \right) \, . \end{aligned}

Therefore, due to an unobserved precursor variable W, researchers would estimate $$a_1 b_2$$ which is equal to

\begin{aligned} a_1 b_2 = \left[ \alpha _1 + \alpha _2 \gamma _1 \frac{ \sigma _W^2 }{ \gamma _1^2 \sigma _W^2 + \sigma _1^2 } \right] \left[ \beta _2 + \alpha _2 \beta _3 \left( \frac{ \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + ( \gamma _1^2 \sigma _W^2 + \sigma _1^2 ) \sigma _2^2 } \right) \right] \, . \end{aligned}

which is not equal to $$\alpha _1 \beta _2$$ in general.

### Appendix 2: Bias in the estimation of direct effect

From Eqs. (6) and (7),

\begin{aligned} b_1&= \frac{ Cov( X, Y ) - b_2 Cov( X, M ) }{ V(X) } \\&= \frac{ Cov( X, Y ) - \left( \frac{ Cov( Y, M ) - b_1 Cov( X, M ) }{ V(M) } \right) Cov( X, M ) }{ V(X) } \\&= \frac{ V(M) Cov(X, Y) - Cov(Y, M) Cov(X, M) + b_1 [ Cov(X,M) ]^2 }{ V(X) V(M) } \, , \end{aligned}

which can be expressed as

\begin{aligned} b_1 = \frac{ V(M) Cov(X, Y) - Cov(Y, M) Cov(X, M) }{ V(X) V(M) - [ Cov(X,M) ]^2 } \, . \end{aligned}

Recall Eqs.  (10), (8), (12), (11), and (9) for each term of $$b_1$$. After some algebraic work, the numerator of $$b_1$$ can be simplified as

\begin{aligned}&V(M) Cov(X, Y) - Cov(Y, M) Cov(X, M) = \beta _1 [ \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \sigma _W^2 + \sigma _1^2) \sigma _2^2 ] \\&+ \beta _3 ( \gamma _1 \sigma _2^2 \sigma _W^2 - \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2 ) \, , \end{aligned}

and the denominator of $$b_1$$ can be simplified as

\begin{aligned} V(X) V(M) - [ Cov(X,M) ]^2 = \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \sigma _W^2 + \sigma _1^2) \sigma _2^2 \, , \end{aligned}

To this end, we can express $$b_1$$ as

\begin{aligned} b_1 = \frac{ \beta _1 ( \gamma _1^2 \sigma _2^2 \sigma _W^2 + \alpha _2^2 \sigma _1^2 \sigma _W^2 + \sigma _1^2 \sigma _2^2 ) + \beta _3 ( \gamma _1 \sigma _2^2 \sigma _W^2 - \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2 ) }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \sigma _W^2 + \sigma _1^2) \sigma _2^2 } \end{aligned}

which simplifies as

\begin{aligned} b_1 = \beta _1 + \beta _3 \left( \frac{ \gamma _1 \sigma _2^2 \sigma _W^2 - \alpha _1 \alpha _2 \sigma _1^2 \sigma _W^2 }{ \alpha _2^2 \sigma _1^2 \sigma _W^2 + (\gamma _1^2 \sigma _W^2 + \sigma _1^2) \sigma _2^2 } \right) \, , \end{aligned}

and it is not equal to $$\beta _1$$ in general.

## Rights and permissions

Reprints and Permissions

Kim, S.B., Lee, J. Regression-based mediation analysis: a formula for the bias due to an unobserved precursor variable. J. Korean Stat. Soc. 50, 1058–1076 (2021). https://doi.org/10.1007/s42952-021-00105-9

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1007/s42952-021-00105-9

### Keywords

• Mediation analysis
• Regression
• Bias
• Precursor variable