Regression-based mediation analysis: a formula for the bias due to an unobserved precursor variable

Researchers want to know whether the change in an explanatory variable X affects the change in a response variable Y (i.e., X causes Y). In practice, there can be two causal paths from X to Y, the path through a mediating variable M (indirect effect) and the path not through M (direct effect). The parameter estimation and hypothesis testing can be performed by a regression-based mediation model. It is already known that randomization of X is not enough for unbiased estimation, and the bias due to an unobserved variable has been discussed in literature but often overlooked. In this article, we first review the challenge under a simple mediation model, then we provide a formula for the exact bias due to an unobserved precursor variable W, the variable which potentially causes the changes in X, M, and/or Y. We present simulation studies to demonstrate the impact of an unobserved precursor variable on hypothesis testing for indirect effect and direct effect. The simulation results show that the inflation of type I error is serious particularly in a large sample study. To numerically demonstrate the formula of the exact bias, a popular data set published in a journal of statistics education is revisited, and we quantify why the conclusion of data analysis can be different before and after accounting for the precursor variable. The result shall remind the importance of a precursor variable in mediation analysis.


Introduction
In educational and behavioral research or related areas, researchers often want to answer a scientific question whether the change in X affects the change in Y (i.e., a causal relationship from X to Y). In this case, X is referred to as the explanatory variable (or independent variable), and Y is referred as the response variable (dependent variable or outcome variable). In this article, X is a Bernoulli random variable (one or zero), a discrete random variable, or a continuous random variable, and Y is a continuous random variable. The causal relationship is often denoted by X → Y , and the direction of arrow matters. If a researcher controls the change in X, it is called an experimental study. Otherwise, it is called an observational study. Caveats and challenges of causal inference through an observational study have been widely discussed in various disciplinary areas (Glass et al. 2013;Kang 2014;Rohrer 2018;Adams et al. 2019).
In mediation analysis (Baron and Kenny 1986;Hayes 2013), suppose there are two causal paths from X to Y. The first path is X → M → Y , and M is referred to as the mediating variable (mediator or intermediate variable). The second path is X → Y not through M. This mediation model is graphically illustrated in Fig. 1, and it is the simplest mediation model presented by Hayes (2013) which is highly cited (more than 22,000 as of now) by many researchers. The first path is often referred to as the indirect effect, and the second path is often referred to as the direct effect. Hayes (2013) presented more complex mediation models than the one shown in Fig. 1.
Since Hayes (2013) developed the PROCESS macro in the statistical software SPSS, many researchers in social science have applied to prove complex causal paths, and the mediation models are still popular (Caniëls 2019;Garcia et al. 2018;Seli et al. 2017;Zhang et al. 2019;Emery et al. 2019;Zhu et al. 2019;Villaluz and Hechanova 2018;Manuti and Giancaspro 2019). Some of these studies were observational, and some were experimental. For instance, Zhang et al. (2019) asked subjects to report their values of two explanatory variables (workplace ostracism and leader-member exchange), and Seli et al. (2017) randomly assigned subjects to either an experimental group (manipulating motivation) or a control group (no motivation). The term "mediation" has an implicit direction of relationship, and many researchers used the mediation model (or a more complex mediation model) based on data collected in an observational study (i.e., no randomization of X). Even though many researchers have conducted observational studies, the direction of a causal relationship has been justified by a common sense or an acceptable theory.
The purpose of this article is not to discourage the use of a mediation model. The purpose is to remind researchers the impact of an unobserved "precursor variable" (denoted by W in this article) on the probability of concluding the presence of an indirect effect and/or a direct effect. We use the term "precursor variable" to refer that W Fig. 1 The simplest mediation model presented by Hayes (2013) 1 3 precedes X, M, and Y in a causal relationship as shown in Fig. 2. The necessity of randomization (i.e., controlling X) has been accepted in scientific communities since it was first advocated by Ronald A. Fisher (Fisher 1925;Hall 2007). However, it has been shown that the randomization does not completely remove bias in a certain mediation analysis. VanderWeele (2010) and Imai et al. (2010aImai et al. ( , 2010b presented bias formulas under regression models, and Hong et al. (2018) discussed various methods of sensitivity analysis (Hong et al. 2015(Hong et al. , 2018VanderWeele 2015).
In this article, we review the basic regression-based mediation models (Figs. 1, 2; Hayes 2013), thoroughly present formulas in terms of the regression parameters to quantify the bias in the estimation of indirect effect and direct effect in the absence of W, and use simulations to demonstrate its consequence (inflated Type I error rate). We assume that readers of this article have background knowledge of multiple linear regression models and basic theorems in mathematical statistics including: Cov(X, X) = V(X) and where an uppercase letter denotes a random variable and a lowercase letter denotes a constant real number. In Sect. 2, we consider a simple case when the causal relationship from X to Y does not involve a mediator M, and a formula will demonstrate that randomization of X is enough to remove bias in this simple case. In Sect. 3, we consider a case when the causal relationship involves a mediator M, and another formula will clearly demonstrate that randomization of X is not enough to remove bias in this more complicated case. In Sect. 4, we present simulation results which demonstrate seriously inflated Type I error rates even in an experimental study. In Sect. 5, a numerical example is provided based on the data collected in the United States and previously analyzed by Guber (1999).

Causal relationship without a mediator
Let X denote an explanatory variable of interest (observed), Y denote a response variable of interest (observed), and W denote an omitted (unobserved) precursor variable which may affect X and/or Y (see Fig. 3). Suppose a researcher assumes the simple linear model and suppose the true relationship among the three random variables, W, X, and Y, is given by the two linear models Figure 3 graphically illustrates this situation. If we let 2 W = V(W) and 2 i = V( * i ) for i = 1, 2 , then the variances V(X) and V(Y) can be expressed in terms of 2 W , 2 1 , 2 2 , and the regression parameters.
Here our goal is to express b 1 (the quantity estimated by the researcher) in terms of 1 , 2 , 1 , W , and 1 . Under the simple linear model assumed by the researcher, so the estimand b 1 can be expressed as Under the true relationship, Therefore, the researcher eventually estimates a complex quantity where V(W) = 2 W and V(X) = 2 1 2 W + 2 1 . The researcher can accomplish b 1 = 1 by randomization of X (i.e., 1 = 0 ). When in an observational study (i.e,. 1 ≠ 0 ), the researcher can estimate b 1 = 1 when 1 ≠ 0 , but this is out of researcher's control. 1 3

Causal relationship with a mediator
Consider a more complex case when M is a mediating variable (observed) in the causal path from X to Y. Suppose a researcher assumes the mediation model (Hayes 2013) with the following two linear models: Suppose the true relationship among the four random variables, W, X, M, and Y, is given by the three linear models

The impact of an unobserved precursor on indirect effect
For the indirect effect a 1 b 2 , we first note that a 1 and b 2 quantify the relationships among X, M, and Y as follow: (2) Note that a 1 and b 2 can be expressed as To this end, the researcher's estimand a 1 b 2 for the indirect effect becomes a very complex quantity Appendix 1 provides a detail explanation of the derivation of Eq. (3). There are two cases when a 1 b 2 = 1 2 . The first case is when the precursor W does not affect the mediator M (i.e., 2 = 0 ). The second case is when W does not affect both explanatory variable X and response variable Y (i.e., 1 = 3 = 0 ). In either case, randomization of X (i.e., 1 = 0 ) is not sufficient to avoid the bias. In particular, when 1 2 = 0 is true with 1 ≠ 0 and 2 = 0 , researchers will estimate a 1 b 2 ≠ 0 which leads to an inflated Type I error rate.

The impact on direct effect
For the direct effect b 1 , it can be similarly shown that In the true causal relationship, which starts from the precursor W, it can be shown that Appendix 2 provides a detail explanation of the derivation of Eq. (4).
There are four cases when b 1 = 1 . The first case is when W does not affect Y (i.e., 3 = 0 ). The second case is when 1 = 1 = 0 , and the third case is when 2 W which is nearly uninterpretable. Again, in any of the four cases, the randomization of X (i.e., 1 = 0 ) does not guarantee b 1 = 1 . In particular, when 1 = 0 is true, researchers will estimate b 1 ≠ 0 which leads to an inflated type I error rate. Furthermore, if 1 = 0 and all 1 , 1 , 2 , and 3 have the same sign, depending on their magnitudes (and magnitudes of W and 2 ), b 1 and 1 may result in opposite signs which leads to an awkward conclusion.

Simulation designs
To demonstrate the danger of mediation analysis due to an omitted precursor W even with manipulation of X (i.e., 1 = 0 ), a simulation study was conducted. For all simulation scenarios, we fixed 1 = 0 , 2 = 0 , W = 5 , 1 = 1 , 2 = 5 , and 3 = 0.5 and varied 1 = 0, 0.5, 1 , 2 = −1, −0.5, 0 , and 3 = −0.5, 0, 0.5 to create twenty-seven scenarios such that 1 = 0 (no direct effect) and 1 2 = 0 (no indirect effect). For each scenario, we considered sample sizes n = 50, 100, 500, 1000 , and we estimated (1) the probability of concluding the presence of indirect effect (i.e., a 1 b 2 ≠ 0 ) and (2) the probability of concluding the presence of direct effect (i.e., b 1 ≠ 0 ) based on bias-adjusted 95% confidence intervals (CIs) with 2000 bootstrap samples. The nonparametric method is known to be robust for the mediation analysis, so it is recommended for mediation analysis in practice (Hayes 2009;Preacher and Hayes 2008;Hair et al. 2014). Each scenario was replicated 1000 times for estimating the probability of concluding a 1 b 2 ≠ 0 and the probability of concluding b 1 ≠ 0 . In other words, out of 1000 replications per scenario, the proportion of times that a 95% CI for a 1 b 2 excludes zero and the proportion of times that a 95% CI for b 1 excludes zero are calculated.

Simulation results
The simulation results are summarized in Table 1. In seven scenarios (10, 12, 15, 19, 21, 22, and 24), the probability of concluding b 1 ≠ 0 and the probability of concluding a 1 b 2 ≠ 0 were substantially greater than 0.05 when n = 50 , and the probabilities increased as n increased. The bias-adjusted method accurately estimated b 1 and a 1 b 2 according to equations (4) and (3) (see Table 2), so there was a higher chance of eliminating b 1 = 0 and a 1 b 2 = 0 as n increased (i.e., shorter CIs around the true values of b 1 and a 1 b 2 ). For the other twenty scenarios, where b 1 = 0 and a 1 b 2 = 0 according to Eqs. (4) and (3), the probabilities of concluding b 1 ≠ 0 and a 1 b 2 ≠ 0 were close to 0.05. In some of the twenty scenarios, the probabilities slightly exceeded 0.05 when n = 50 , and these results imply that the bootstrap method may require a larger sample size than n = 50 in some cases in order to properly estimate the uncertainty in the interval estimation. Guber (1999) discussed the impact of omitting a (confounding) variable in the association between public school expenditures and academic performance (measured by the average SAT score). In the data, an individual is a state (not a student), and the data consist of all 50 states in the United States. In the bold scenarios (10, 12, 15, 19, 21, 22, and 24), the probability of concluding the presence of direct effect and indirect effect increases as the sample size increases Table 1 (continued)

Example
Parameters (  In this section, to demonstrate the role of a potential precursor variable, we turn our focus on the association between the state average annual salary of teachers in public schools (denoted X; in thousands of US dollars) and the state average SAT score (denoted Y). If we consider the simple linear regression Y = c 0 + c 1 X + , the estimated slope is ĉ 1 = −5.5396 . This result suggests a higher average annual salary is associated with a lower average SAT performance ( p = 0.001).
An important (potential mediating) variable may be the percentage (%) of students taking the SAT in each state (denoted by M). Suppose we model M = a 0 + a 1 X + 1 and Y = b 0 + b 1 X + b 2 M + 2 as shown in Eq. (1). The estimated regression parameters are â 1 = 2.7783 , b 1 = 2.1804 , and b 2 = −2.7787 . Conditioning on the potential mediating variable (% SAT takers), it appears that a higher average annual salary is associated with a higher average SAT performance ( p = 0.039 ). The opposite signs between ĉ 1 = −5.54 and b 1 = 2.18 are due to the fact ĉ 1 =b 1 +â 1b2 , where â 1 > 0 and b 2 < 0 . More SAT takers tend to lower the state average SAT score (Guber 1999).
Given the adjusted estimate b 1 = 2.18 with the small p value ( p = 0.039 ), can we conclude that the state average annual salary and the state average SAT performance are positively associated? Let us consider a potential precursor variable, the state expenditure per students in public schools (denoted W; in thousands of US dollars), as W may affect X, M, and Y. Using the three regression models presented in Eq.
(2), we can estimate the regression parameters as shown at the bottom of Fig. 5 with ̂2 W = 1.8201 , ̂2 1 = 8.4214 , and ̂2 2 = 425.7959 . Note that ̂1 = −0.31 with p = 0.853 suggest no strong evidence for the positive association. The opposite signs between b 1 = 2.18 and ̂1 = −0.31 can be explained by Eq. (4) with the estimated regression parameters, The relatively big positive estimates ̂3 = 13.33 , ̂1 = 3.79 , and ̂2 2 = 425.8 could alter from the negative ̂1 = −0.31 (or nearly zero) to positive b 1 = 2.18 (with the small p-value) by omitting the precursor variable (state expenditure) which might affect the whole mechanism among the state average annual salary of teachers, the % students taking SAT, and the state SAT performance. The estimated indirect association ̂1̂2 = −5.32 further suggests that a higher state average salary of teachers does not help improving the average SAT performance.

Discussion
The primary focus of the simulation study was an inflated Type I error rate (i.e., a higher chance of falsely claiming direct and/or indirect effect) in a mediation analysis due to an unobserved precursor variable. Though the example in Sect. 5 was an observational study, the message was similar. Even for an experimental study, the simulation results alert researchers about the inflated Type I error rate (worse when n is larger), and the expressions (Eqs. (3) and (4)) explain why the bias exists even after randomization of X (i.e., forcing 1 = 0 ) in the presence of a mediator (Fig. 4). The uncomfortable fact is that an experimental design, which can change V(X) = 2 1 2 W + 2 1 = 2 1 , cannot fix the bias in particular null cases. For instance, if 1 ≠ 0 and 2 = 0 (i.e., zero indirect effect), according to Eq. (3), which is independent of V(X) = 2 1 . For another instance, if 1 = 0 (i.e., zero direct effect), according to equation (4), which is independent of V(X) = 2 1 again. In this article, we focused on one precursor variable. In practice, there can be two or more precursor variables. The take-home messages are clear for reducing bias in the estimation of 1 2 (indirect effect) and 1 (direct effect). First, randomize X (i.e., 1 = 0 ) if possible. Second, during data collection, researchers are suggested to record variables which are potentially related to M and Y and adjust them in the regression analysis. A large sample size can tolerate a mild degree of over-fitting due to many W's adjusted in the model. The adjustment will help researchers estimate 1 2 and 1 with a small bias and perform hypothesis testing with a reduced inflation of Type I error rate. Under the assumed model M = a 0 + a 1 X + 1 , the covariance between X and M is given by so, Similarly, it can be written as Y − b 2 M = b 0 + b 1 X + 2 , so Therefore, Eq. (6) can be written as By solving for b 2 , Now consider the three true models: From the true relationships among W, X, M, and Y, we have and Cov(X, M) = Cov(X, a 1 X) = a 1 V(X) Therefore, Eq. (5) can be expressed as the true model parameters as To express b 2 in Eq. (7)  Cov(X, M) = Cov(X, 1 X + 2 W) = 1 V(X) + 2 Cov(X, W) After some algebraic work, it can be shown that the denominator of b 2 in Eq. (7) can be simplified as and the numerator of b 2 in Eq. (7) can be expressed as To this end, we can express b 2 in Eq. (7) as which simplifies as Therefore, due to an unobserved precursor variable W, researchers would estimate a 1 b 2 which is equal to which is not equal to 1 2 in general.

Appendix 2: Bias in the estimation of direct effect
From Eqs. (6) and (7), Cov(X, Y) = Cov(X, 1 X + 2 M + 3 W) = 1 V(X) + 2 Cov(X, M) + 3 Cov(X, W) Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.