This article introduces methods of direction dependence and presents a unified statistical framework to discern the causal direction of effects in linear models using observational data. Existing regression-type methods allow researchers to quantify the magnitude of hypothesized effects but are of limited use when establishing the direction of effects between variables—that is, whether xy or yx correctly describes the causal flow between two variables x and y. The statistical framework proposed in this article allows researchers to make conclusions about the direction of effects. In the present work, we focus on observational (nonexperimental) data settings because this type of data is the chief material for the presented principles of direction dependence. The issue of effect directionality in the context of experimental studies (e.g., when decomposing total effects into direct and indirect effect components) will be taken up in the Discussion section.

Establishing cause–effect relations between variables is a central aim of many empirical studies in the social sciences. The direction of influence between variables is a key element of any causal theory that purports to explain the data-generating mechanism (Bollen, 1989). Questions concerning direction of effect naturally arise in observational studies. For example, it may not be entirely clear whether tobacco consumption causes depression and anxiety or whether people with symptoms of depression and anxiety are more likely to engage in health damaging behavior (Munafò & Araya, 2010; Taylor et al., 2014); whether violent video games expose players to aggressive behavior or whether aggressive people are simply more attracted to violent video games (Gentile, Lynch, Linder, & Walsh, 2004); or whether lead exposure contributes to the development of ADHD or whether children with ADHD symptoms are unable to stay focused enough to avoid lead-tainted objects (Nigg et al., 2008).

Unfortunately, also useful in assessing the magnitude and statistical significance of an assumed causal effect, standard regression-based methods are of limited use when addressing directionality issues per se. When an association between x and y exists, at least three possible explanations can be entertained: (1) x causes y (xy), (2) y causes x (yx), and (3) neither relation exists due to a spurious association of both variables with a third variable (sometimes termed a “confounder”; see Fig. 1).Footnote 1 The Pearson product-moment correlation and ordinary least square (OLS) estimates do not adjudicate regarding the model that best represents the data-generating mechanism (von Eye & DeShon, 2012). Researchers who use regression models must therefore make their decision as to the direction of effect on the basis of a priori theory and substantive arguments. However, statistical tools often are desirable to empirically demonstrate the explanatory superiority of one theory over plausible alternatives. The present contribution introduces such a tool—direction dependence analysis (DDA; Wiedermann & von Eye, 2015a). Although standard regression models use only estimates of first- and second-order moments (i.e., means, variances, and covariances) to assess the magnitude and statistical significance of regression weights, DDA, by contrast, uses estimates of higher-order moments (i.e., skewness and kurtosis) to assess the relative plausibility of directional alternatives.

Fig. 1
figure 1

Conceptual diagrams of three possible models to explain the association between two variables x and y. Squares are used for observed variables and circles are used for latent variables (with error terms denoted with e). Model a = directional effect of x on y; Model b = directional effect of y on x; Model c = confounded x–y association due to an unmeasured third variable u. The two dashed arrows in Model c indicate that either xy or yx can be biased by confounding

Methods of causal discovery have experienced rapid development within the last decades and various causal search algorithms have been proposed (see Spirtes & Zhang, 2016, for a recent overview). These search algorithms are designed to learn plausible causal structures from multivariate data. Examples of such algorithms include the PC algorithm (Spirtes, Glymour, & Scheines, 2000), greedy equivalence search (Chickering, 2002), cyclic causal discovery (Richardson & Spirtes, 1999), fast causal inference (Zhang, 2008), and linear non-Gaussian acyclic models (cf. Shimizu, 2016; Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006) that are either designed to discover (Markov) equivalence classes of directed acyclic graphs (DAGs; i.e., a small subset of candidate models that have the same support by the data in terms of model fit; cf. Verma & Pearl, 1991) or uncover DAG structures beyond equivalence classes. All these algorithms constitute important exploratory tools for causal learning and are, thus, ideally suited to generate new substantive hypotheses concerning the causal nature of constructs.

DDA, in contrast, is concerned with a research scenario that is confirmatory in nature—that is, situations in which a substantive theory about the causal relation exists and the researcher wishes to know if the causal direction assumed by the model is plausible relative to the alternative scenarios a reasonable skeptic might propose. The primary goal is to probe this causal theory against alternatives while adjusting for potential background variables known to be explanatory in nature. Thus, instead of extracting plausible DAG structures (or classes of equivalent DAGs) for a given dataset, one is interested in testing a specific model (e.g., lead exposure → ADHD) against a plausible alternative (ADHD → lead exposure). The present article is designed to introduce principles of DDA to quantitative researchers. Although previous work (e.g., Dodge & Rousson, 2000, 2001; Dodge & Yadegari, 2010; Muddapur, 2003; Sungur, 2005; Wiedermann & von Eye, 2015a, 2015b, 2015c) has focused on direction dependence methods to choose between the two models xy and yx (Figs. 1a and b), direction dependency in the presence of confounders has received considerably less attention. To fill this void, we present extensions of DDA to scenarios in which confounders are present and incorporate these new insights into the existing direction dependence principle. As a result, we propose a unified framework that allows one to identify each explanatory model given in Fig. 1.

The article is structured as follows: First, we define statistical models suitable for DDA and introduce model assumptions. We then introduce elements of DDA and summarize new results that describe the behavior of DDA tests when unmeasured confounders are present. Next, three SPSS macros are introduced that make DDA accessible to applied researchers and a data example is given to illustrate their application. The article closes with a discussion of conceptual and empirical requirements of DDA, potential data-analytic pitfalls, and potential extensions of the DDA methodology. In addition, sample size recommendations based on previous simulation studies are provided.

The direction dependence principle

Direction dependence can be defined as the asymmetry between cause and effect. The model xy implies that changing x (the cause) changes y (the effect) but changing y does not lead to change in x (see also Pearl, 2009; Peters, Janzing, & Schölkopf, 2017). Reversely, when changing y changes x but, at the same time, changing x does not change y, then the model yx describes the causal relation. Limitations of conventional association-based approaches to uncover the asymmetry of cause and effect can be explained by the fact that these methods only consider variation up to the second order moments and, thus, rely on correlation in its symmetric form—that is, cor(x, y) = cor(y, x). The key element of DDA is to consider variable information beyond second order moments (specifically skewness and kurtosis) because asymmetry properties of the Pearson correlation and the related linear model appear under nonnormality. These asymmetry properties are of importance when x and y are not exchangeable in their roles as explanatory and response variables without leading to systematic model violations. DDA, thus, requires and makes use of nonnormality of variables to gain deeper insight into the causal mechanism. DDA consists of three core components: (1) distributional properties of observed variables, (2) distributional properties of error terms of competing models, and (3) independence properties of error terms and predictors in the competing models. On the basis of new results regarding direction dependence in the presence of confounders, we show that unique patterns of DDA component outcomes exist for each of the three models shown in Fig. 1. These outcome patterns enable researchers to select between competing models. In the following paragraphs, we formally define the statistical models considered. We then introduce DDA components separately for confounder-free “true” models and when confounders are present. In addition, statistical inference compatible with direction dependence is discussed. To simplify presentation, we assume that xy corresponds to the “true” model and yx represents the directionally mis-specified model.

Model definitions

We start the introduction to DDA by defining the statistical models considered. Although statistical models can either be used for the purposes of explanation or predictionFootnote 2 (Geisser, 1993), DDA is designed for the task of validating explanatory models, that is, to test the causal hypothesis assumed under a given theory. Assume that a construct \( \mathcal{X} \) (e.g., lead exposure) causes construct \( \mathcal{Y} \) (e.g., ADHD symptomatology) through mechanism \( \mathcal{F} \)—that is, \( \mathcal{Y}=\mathcal{F}\left(\mathcal{X}\right) \). Furthermore, let x and y be operationalizations of \( \mathcal{X} \) and \( \mathcal{Y} \) (e.g., blood lead concentration and number of DSM-IV hyperactive–impulsive symptoms) and define f as the statistical model (e.g., the linear model) to approximate \( \mathcal{F} \)—that is, y = f(x). The direction dependence framework provides a set of statistical tools to evaluate the directionality assumption of y = f(x) implied by the causal theory \( \mathcal{X} \)\( \mathcal{Y} \).

DDA assumes that \( \mathcal{X} \) is a nonnormally distributed construct whose cause lies outside the causal mechanism \( \mathcal{X} \)\( \mathcal{Y} \) and that the two observed variables, x and y, are continuous. Furthermore, we assume that the data-generating mechanism is recursive in nature (i.e., the causal flow is unidirectional without feedback loops) and can be approximated by the linear model—that is, given that \( \mathcal{X} \)\( \mathcal{Y} \) constitutes the “true” mechanism, y = f(x) can be written as (cf. Fig. 1a; for simplicity, but without loss of generality, we assume that the intercept is fixed at zero)

$$ y={b}_{yx}x+{e}_{yx} $$
(1)

The slope byx denotes the change in the fitted value of y for a one-unit increase in x and represents the causal effect of xy. Estimates of the causal effect are usually obtained using OLS or, in structural equation models (SEMs), maximum likelihood estimation. Nonnormality of the “true” predictor, quantified as nonzero skewness \( {\gamma}_x=E\left[{\left(x-E\left[x\right]\right)}^3\right]/{\sigma}_x^3 \) and/or nonzero excess kurtosis \( {\kappa}_x=E\left[{\left(x-E\left[x\right]\right)}^4\right]/{\sigma}_x^4-3 \) (with E being the expected value operator), is assumed to reflect an inherent distributional characteristic of \( \mathcal{X} \) (as opposed to nonnormality due to boundaries of the operationalized x). The error term eyx is assumed to be normally distributed (with zero mean and variance \( {\sigma}_{e_{yx}}^2 \)), serially independent, and independent of x.

When \( \mathcal{Y} \)\( \mathcal{X} \) instead of \( \mathcal{X} \)\( \mathcal{Y} \) describes the causal mechanism (Fig. 1b), the corresponding linear model is

$$ x={b}_{xy}y+{e}_{xy} $$
(2)

where bxy captures the causal effect of y on x. Here, y represents a nonnormal external influence and exy denotes a normally distributed error term with zero mean and variance \( {\sigma}_{e_{xy}}^2 \), which is serially independent and independent of y.

Finally, as a third possible explanation, we consider an unknown confounding construct \( \mathcal{U} \) (and its continuously operationalized variable u) that has a causal effect on both \( \mathcal{X} \) and \( \mathcal{Y} \). A conceptual diagram of a confounded xy relation is given in Fig. 1c. The dashed arrows in Fig. 1c indicate that either the model xy in Eq. 1 or yx in Eq. 2 can be biased by confounding. For the purpose of introducing the confounder model, we focus on the model xy. In this case, the model in Eq. 1 changes to

$$ {\displaystyle \begin{array}{l}x={b}_{xu}u+{e}_{xu}\\ {}y={b}_{yx}x+{b}_{yu}u+{e}_{yu}\end{array}} $$
(3)

where bxu and byu denote the effects of the confounder on x and y, respectively. The considered setup also includes the case of “pure” confounding as a special case (i.e., the observed association between x and y is entirely attributable to the existence of u) when byx = 0. In Model 3, u is assumed to be a nonnormally distributed external influence, and exu and eyu are normally distributed error terms (exhibiting zero means and variances \( {\sigma}_{e_{xu}}^2 \) and \( {\sigma}_{e_{yu}}^2 \)) that are independent of u and of each other.

It is common to include covariates (e.g., background or baseline measures) in statistical models to increase precision of parameter estimates and statistical power. In other words, covariates are independent variables that are considered in a target model to control for their influences on the putative response. In contrast, failing to include important covariates can lead to confounded parameter estimates when covariates are (in addition to their relation to the response) correlated with other predictors. However, several authors have cautioned against careless use of covariates because conditioning on covariates can also increase the bias of causal estimates (Pearl, 2009; Spirtes, Richardson, Meek, Scheines, & Glymour, 1998). Similar considerations hold for statistical models in the context of DDA. To be eligible for DDA, covariates must be known to be on the explanatory side of the statistical model. In addition, one must ensure that a recursive causal ordering of the covariates themselves is theoretically possible and that all covariates can be expressed as linear combinations of mutually independent external influences. We can formally express these prerequisites for a given set of covariates zj (j = 1, ... , J) as zj = ∑k(i) ≤ k(j)ajiηi, with k(i) ≤ k(j) describing the causal order of covariates (i.e., zi precedes zj). The parameter aji describes the total effect and ηi denoting the external influence associated with zi. When no other covariate precedes zj, one obtains zj = ηj with aji = 1. For example, suppose that two covariates (z1 and z2) are known to influence y and that z1 precedes z2 and no other covariate precedes z1 (i.e., z1 = η1). In this case, one obtains z2 = a21η1 with z1 = η1, which implies that z2 can be expressed as a (weighted) external influence. Consider again the example of ADHD and blood lead exposure. Two factors that are known to affect ADHD symptomology are prenatal maternal emotional stress (Harris & Seckl, 2011) and the cultural context of the child (Miller, Nigg, & Miller, 2009; see also Nigg, 2012). Arguments of temporality or logical order of effects can be used to evaluate the eligibility of covariates for DDA. Both, prenatal maternal stress and cultural context are located earlier in time than the child’s blood lead level and ADHD symptomology under study that justifies their use as background variables. Furthermore, in principle, we are also able to establish a causal order of the covariates themselves—that is, cultural context may be conceived as a background variable directly or indirectly contributing to maternal stress level. In other words, both variables are unlikely to render the target model cyclic, which makes them eligible to be covariates in DDA. In general, covariates can be either continuous or categorical. For categorical covariates, however, we need to assume that these variables constitute external influences themselves—that is, we exclude cases in which categorical variables serve as outcomes of other independent variables in the model (detailed explanations will be given below). Although this assumption is stricter than the continuous case, it still allows multiple-group scenarios in which the magnitude of the causal effect of predictor and outcome can vary across groups. When categorical covariates are present, a two-stage approach of model estimation is preferable. That is, in a first step, the effect of categorical covariates is partialled out of the putative predictor (e.g., x), the putative outcome (y), and all the continuous covariates and extracted regression residuals from these auxiliary models are subsequently used as “purified” measures (an example is given below). According to the Frisch–Waugh–Lovell theorem (cf. Frisch & Waugh, 1933; Lovell, 1963; sometimes called the regression anatomy formula: Angrist & Pischke, 2009), regressing the “purified” outcome on the “purified” independent variables in the second step leads to the same model parameters as in the full multiple regression model including categorical covariates.

DDA Component I: Distributional properties of observed variables

Absence of confounders

Asymmetry properties in terms of observed variable distributions emerge from the additive nature of the linear model—that is, a response is defined as the sum of a (nonnormally distributed) explanatory part and a (normally distributed) error component. Intuitively, distributional differences of predictor and response variables emerge because the response is defined as the convolution of a nonnormal and a normal variate. In other words, adding a normal error term to a nonnormal predictor will necessarily cause the response to be more normally distributed than the predictor. Dodge and Rousson (2000, 2001) as well as Dodge and Yadegari (2010) presented algebraic proofs for this relation and showed that the Pearson correlation ρxy has asymmetric properties when considering higher moments of x and y. Specifically, the cube of ρxy can be expressed as the ratio of skewness of response and predictor,Footnote 3

$$ {\rho}_{xy}^3=\frac{\gamma_y}{\gamma_x} $$
(4)

(as long as γx ≠ 0) and the fourth power of ρxy can be written as the ratio of excess kurtosis of response and predictor,

$$ {\rho}_{xy}^4=\frac{\kappa_y}{\kappa_x} $$
(5)

(as long as κx ≠ 0). Because ρxy is bounded on the interval [–1, 1], absolute values of skewness and excess kurtosis of the response y will always be smaller than absolute skewness and excess-kurtosis values of the predictor x. In other words, when Model 1 approximates the data-generating mechanism, y will be closer to the normal distribution than x. This asymmetry property opens the door for evaluation of the directional plausibility of a linear model by evaluating the skewness and excess kurtosis of a tentative response and a tentative predictor. Note that Eqs. 4 and 5, as proposed by Dodge and Rousson (2000, 2001), hold for the bivariate case. However, a two-step regression approach can be used to adjust for covariates defined above. First, two regression models are estimated in which x and y serve as responses and covariates zj are used as independent variables—that is, \( y={\sum}_{j=1}^J{b}_{yz_j}{z}_j+{e}_{yz_j} \) and \( x={\sum}_{j=1}^J{b}_{xz_j}{z}_j+{e}_{xz_j} \). Next, the estimated regression residuals of the two models, \( {e}_{yz_j} \) and \( {e}_{xz_j} \), are used as auxiliary variables reflecting the (unexplained) portion of variation after adjusting for the covariates zj. Regressing \( {e}_{yz_j} \) on \( {e}_{xz_j} \) gives the same regression coefficient as obtained in the multiple linear model {x, zj} → y and the OLS model \( {e}_{yz_j}\to {e}_{xz_j} \) gives the identical estimate as in the multiple linear model {y, zj} → x. Direction dependence decisions can then be based on these auxiliary measures. For example, for one covariate z, Model 1 extends to

$$ y={b}_{yx}x+{b}_{yz}z+{e}_{y(xz)} $$
(6)

“Purified” measures of x and y are obtained through \( {e}_{xz}=x-{b}_{xz}^{\prime }z \) and \( {e}_{yz}=y-{b}_{yz}^{\prime }z \), where \( {b}_{xz}^{\prime } \) and \( {b}_{yz}^{\prime } \) denote the OLS estimates when regressing x and y on z. Then Eq. 6 can be rewritten as

$$ {e}_{yz}={a}_{yx}{e}_{xz}+{\theta}_{yx} $$
(7)

with \( {a}_{yx}={b}_{yx}=\left({\rho}_{xy}-{\rho}_{yz}{\rho}_{xz}\right)/\left(1-{\rho}_{xz}^2\right) \) being the partial regression coefficient and θyx denoting the error term that is identical to ey(xz). For the model in Eq. 7, one obtains (a proof is given in online Appendix A)

$$ {\rho}_{xy\left|z\right.}^3=\frac{\gamma_{e_{yz}}}{\gamma_{e_{xz}}} $$
(8)
$$ {\rho}_{xy\left|z\right.}^4=\frac{\kappa_{e_{yz}}}{\kappa_{e_{xz}}} $$
(9)

(as long as \( {\upgamma}_{e_{xz}}\ne 0 \) and \( {\kappa}_{e_{xz}}\ne 0 \)), where \( {\rho}_{xy\mid z}={a}_{yx}\frac{\sigma_{e_{xz}}}{\sigma_{e_{yz}}}=\frac{\rho_{xy}-{\rho}_{yz}{\rho}_{xz}}{\sqrt{1-{\rho}_{xz}^2}\sqrt{1-{\rho}_{yz}^2}} \) is the partial regression coefficient of x and y adjusting for z, and \( {\gamma}_{e_{yz}} \), \( {\gamma}_{e_{xz}} \), \( {\kappa}_{e_{yz}} \), and \( {\kappa}_{e_{xz}} \) are the skewness and excess-kurtosis values of eyz and exz. Because −1 ≤ ρxyz ≤ 1, higher moments of eyz and exz possess the same properties as higher moments of x and y in the bivariate case. Under the model xy, we obtain \( \mid {\gamma}_{e_{yz}}\mid <\mid {\gamma}_{e_{xz}}\mid \) and \( \mid {\kappa}_{e_{yz}}\mid <\mid {\kappa}_{e_{xz}}\mid \), whereas \( \mid {\gamma}_{e_{yz}}\mid >\mid {\gamma}_{e_{xz}}\mid \) and \( \mid {\kappa}_{e_{yz}}\mid >\mid {\kappa}_{e_{xz}}\mid \) hold under model yx.

Presence of confounders

Because any continuous nonnormal confounder can affect both the distribution of x and the distribution of y, directional decisions based on higher moments of x and y are influenced by (1) the magnitude of nonnormality of u, (2) the connection strength of u and x, and (3) the connection strength of u and y. Formally, this can be shown by rewriting higher moments of x and y as a function of higher moments of u—that is, \( {\gamma}_x={\rho}_{xu}^3{\gamma}_u \), \( {\kappa}_x={\rho}_{xu}^4{\kappa}_u \), \( {\gamma}_y={\rho}_{yu}^3{\gamma}_u \), and \( {\kappa}_y={\rho}_{yu}^4{\kappa}_u \) (which follows from applying Dodge and Rousson’s, 2000, 2001, results to Model 3). Thus, the statistical power to determine the direction of effect depends on the magnitude of confounding effects and the degree of nonnormality of the confounder. If either the distribution of the confounder is close to normality or the influence of the confounder is weak (i.e., ρxu and ρyu are close to zero), no decisions can be made due to lack of sufficient nonnormality of x and y. The influence of the confounder on direction dependence decisions is given through

$$ \frac{\gamma_y}{\gamma_x}={\left(\frac{\rho_{yu}}{\rho_{xu}}\right)}^3 $$
(10)

and

$$ \frac{\kappa_y}{\kappa_x}={\left(\frac{\rho_{yu}}{\rho_{xu}}\right)}^4 $$
(11)

Thus, directional conclusions depend on the relative strength of the confounding effects. No biases in terms of model selection are expected when ∣ρyu ∣  <  ∣ ρxu∣because ∣γy ∣  <  ∣ γx∣ and ∣κy ∣  <  ∣ κx∣still hold, which suggests the model xy. In contrast, biases are likely to occur when ∣ρyu ∣  >  ∣ ρxu∣, because ∣γy ∣  >  ∣ γx∣and ∣κy ∣  >  ∣ κx∣ increase the risk of erroneously selecting the mis-specified model yx.

Statistical inference

von Eye and DeShon (2012) proposed using normality tests, such as D’Agostino’s (1971) skewness and/or Anscombe and Glynn’s (1983) kurtosis test, to evaluate hypotheses compatible with observed-variable based direction dependence. Directional decisions are based on separately evaluating nonnormality of predictor and response. In addition, Pornprasertmanit and Little (2012) suggested nonparametric bootstrap CIs for higher-order moment differences (Δ(γ) =  ∣γx∣  −  ∣γy∣ and Δ(κ) =  ∣κx∣  −  ∣κy∣).

DDA Component II: Distributional properties of error terms

Absence of confounders

The second DDA component focuses on the distributional shape of the error terms, eyx and exy. In essence, distributional differences of the two error terms are likely to occur when the nonnormal “true” predictor is erroneously used as the outcome because predictor nonnormality will, to some extent, be preserved in the error term of the mis-specified model. Wiedermann, Hagmann, Kossmeier, and von Eye (2013), Wiedermann, Hagmann, and von Eye (2015), and Wiedermann (2015) showed that higher moments of the error term obtained from the mis-specified model (exy) can be expressed as functions of the third and fourth moments of the true predictor (x)—that is,

$$ {\gamma}_{e_{xy}}={\left(1-{\rho}_{xy}^2\right)}^{\raisebox{1ex}{$3$}\!\left/ \!\raisebox{-1ex}{$2$}\right.}{\gamma}_x $$
(12)

and

$$ {\kappa}_{e_{xy}}={\left(1-{\rho}_{xy}^2\right)}^2{\kappa}_x $$
(13)

Thus, the skewness and excess kurtosis of exy systematically increase with the magnitude of nonnormality of the “true” predictor. Furthermore, because normality of the error term is assumed in the “true” model (i.e.,\( {\gamma}_{e_{yx}}={\kappa}_{e_{yx}}=0 \)), differences in higher moments of eyx and exy provide, again, information about the directional plausibility of a linear model. This DDA component can straightforwardly be extended to multiple linear regression models when adjusting for possible covariates (cf. Wiedermann & von Eye, 2015b). Under model xy, one obtains \( \mid {\gamma}_{e_{xy}}\mid >\mid {\gamma}_{e_{yx}}\mid \) and \( \mid {\kappa}_{e_{xy}}\mid >\mid {\kappa}_{e_{yx}}\mid \); under model yx, one obtains \( \mid {\gamma}_{e_{xy}}\mid <\mid {\gamma}_{e_{yx}}\mid \) and/or \( \mid {\kappa}_{e_{xy}}\mid <\mid {\kappa}_{e_{yx}}\mid \).

Presence of confounders

When an unmeasured confounder is present, the two competing models can be written as

$$ y={b}_{yx}^{\prime }x+{e}_{yx}^{\prime } $$
(14)
$$ x={b}_{xy}^{\prime }y+{e}_{xy}^{\prime } $$
(15)

where \( {b}_{yx}^{\prime } \) and \( {b}_{xy}^{\prime } \) are biased estimates of byx and bxy. Although the causal estimate in Eq. 14 is biased, the model still correctly represents the data-generating process in terms of directionality. In this case, higher moments of \( {e}_{yx}^{\prime } \) and \( {e}_{xy}^{\prime } \) depend on the magnitude of nonnormality of u and the magnitudes of bxu and byu. Specifically, the higher moments can be written as functions of semipartial correlations and higher moments of u. That is, for \( {e}_{yx}^{\prime } \) one obtains

$$ {\displaystyle \begin{array}{l}{\gamma}_{e_{yx}^{\prime }}={\rho}_{y\left(u|x\right)}^3{\gamma}_u\\ {}{\kappa}_{e_{yx}^{\prime }}={\rho}_{y\left(u|x\right)}^4{\kappa}_u\end{array}} $$
(16)

and for \( {e}_{xy}^{\prime } \) one obtains

$$ {\displaystyle \begin{array}{l}{\gamma}_{e_{xy}^{\prime }}={\rho}_{x\left(u|y\right)}^3{\gamma}_u\\ {}{\kappa}_{e_{xy}^{\prime }}={\rho}_{x\left(u|y\right)}^4{\kappa}_u,\end{array}} $$
(17)

with \( {\rho}_{y\left(u|x\right)}=\left({\rho}_{yu}-{\rho}_{xy}{\rho}_{xu}\right)/\sqrt{1-{\rho}_{xy}^2} \) being the semipartial correlation coefficient for y and u given x and \( {\rho}_{x\left(u|y\right)}=\left({\rho}_{xu}-{\rho}_{xy}{\rho}_{yu}\right)/\sqrt{1-{\rho}_{xy}^2} \) describing the semipartial correlation between x and u given y (see online Appendix A for a proof). The distribution of both error terms will be close to normality and, thus, no distinct decision is possible when u is close to normality and/or semi-partial correlations are close to zero. If the confounder is sufficiently nonnormal, the distributional properties of error terms and, thus, of directional decisions depend on the magnitude of the semipartial correlations. Unbiased directional decisions are possible when ∣ρy(u| x) ∣  <  ∣ ρx(u| y)∣ because \( \mid {\gamma}_{e_{xy}^{\prime }}\mid >\mid {\gamma}_{e_{yx}^{\prime }}\mid \) and \( \mid {\kappa}_{e_{xy}^{\prime }}\mid >\mid {\kappa}_{e_{yx}^{\prime }}\mid \), which implies xy. In contrast, if ∣ρy(u| x) ∣  >  ∣ ρx(u| y)∣, then erroneously selecting yx is likely to occur because \( \mid {\gamma}_{e_{xy}^{\prime }}\mid <\mid {\gamma}_{e_{yx}^{\prime }}\mid \) and \( \mid {\kappa}_{e_{xy}^{\prime }}\mid <\mid {\kappa}_{e_{yx}^{\prime }}\mid \).

Statistical inference

Again, nonnormality tests can be used to separately evaluate distributional properties of model residuals (cf. Wiedermann et al., 2015). An asymptotic significance test and bootstrap CIs for the skewness difference of residuals (\( \Delta \left({\gamma}_e\right)=\mid {\gamma}_{e_{xy}}\mid -\mid {\gamma}_{e_{yx}}\mid \)) have been proposed by Wiedermann et al. (2015) and Wiedermann and von Eye (2015b). The asymptotic test requires normality of the “true” error term. Only error symmetry is required for the bootstrap approach. Analogous procedures for the difference in excess-kurtosis values were discussed by Wiedermann (2015).

DDA Component III: Independence properties of predictor and error term

Absence of confounders

The independence assumption in the linear model implies that the magnitude of the error made when fitting the response is not related in any form to the predictor(s). In OLS regression, it is well-known that estimated residuals will be linearly uncorrelated with the predictor(s), which holds even when the model is directionally mis-specified. However, when the “true” predictor x is nonnormal, the error term and the predictor of the mis-specified model, y and exy, will be stochastically nonindependent. To illustrate this, we start with a simulated data example. Two variables (x and y) were generated according to the linear model xy (with zero intercept, unit slope, and a standard normal error term eyx). The “true” predictor x was either drawn form a standard normal (γx = κx = 0), a standard uniform (i.e., γx = 0, κx = –1.2), or a chi-square distribution with eight degrees of freedom (γx = 1, κx = 1.5). Figure 2 shows scatterplots of the observed predictors and estimated residuals for the “true” model and the mis-specified model yx. In the normal case, the two models cannot be distinguished from each other. That is, for both models circular data patterns occur that can be expected due to linear uncorrelatedness. This no longer holds for nonnormal predictors. Here, the two competing models are mutually distinguishable. Although linear uncorrelatedness also holds for all nonnormal data scenarios, clear dependence structures occur in the mis-specified model. Note that these dependence structures are not the result of special properties of the uniform and the chi-square distributions. In fact, the opposite is the case. The normal distribution constitutes the special case in which competing models cannot be uniquely distinguished, because uncorrelatedness implies stochastic independence in the normal domain (cf. Hoyer, Shimizu, Kerminen, & Palviainen, 2008).

Fig. 2
figure 2

Bivariate patterns of the predictors and residuals of competing linear models (xy and yx) for both normal and nonnormal “true” predictors

Formally, nonindependence in the mis-specified model becomes intuitively obvious if we solve for the error term of the mis-specified model in Eq. 2 and insert the “true” Model 1, which results in (see also Entner, Hoyer, & Spirtes, 2012; Shimizu, 2016)

$$ {e}_{xy}=x-{b}_{xy}y=\left(1-{\rho}_{xy}^2\right)x-{b}_{xy}{e}_{yx} $$
(18)

Thus, both the “true” predictor x and the “true” error term eyx contribute to y in Eq. 1 and exy in Eq. 18. Although this illustration serves as an intuitive explanation, a rigorous proof of nonindependence follows from the Darmois–Skitovich theorem (Darmois, 1953; Skitovich, 1953). The theorem states that if two linear functions (υ1 and υ2) of the same independent random variables wj (j = 1 ,..., J), υ1 = ∑jαjwj and υ2 = ∑jβjwj, with αj and βj being constants, are independent, then all wj for which αjβj ≠ 0 must be normally distributed. The reverse corollary implies that if a common wj exists that is nonnormal, then υ1 and υ2 must be nonindependent (cf. Shimizu et al., 2011; Wiedermann & von Eye, 2015a). Thus, exy in Eq. 18 and y in Eq. 1 are nonindependent because of the common nonnormal variable x, and \( \left(1-{\rho}_{xy}^2\right){b}_{yx}\ne 0 \) (excluding ∣ρxy∣  = 1, due to practical irrelevance). Since the Darmois–Skitovich theorem applies for J variables, covariates can straightforwardly be included in the Models 1 and 2, provided that the covariates fulfill the requirements described above. However, the Darmois–Skitovich theorem concerns continuous random variables wj. Thus, when categorical covariates exist, a two-step regression approach should be applied first with subsequent DDA being performed on residualized x and y variables. Because independence is assumed in the correctly specified model, direction dependence statements are possible through separately evaluating independence in competing models (cf. Shimizu et al., 2011; Wiedermann & von Eye, 2015a). In essence, if the null hypothesis H0 : x ⊥ eyx is retained and, at the same time, H0 : y ⊥ exy is rejected, then it is more likely that the observed effect transmits from x to y. Conversely, if H0 : x ⊥ eyx is rejected and H0 : y ⊥ exy is retained, then the model yx should be preferred.

Presence of confounders

When confounding affects the relation between x and y, predictor(s) and errors of both models contain information of the confounder. This can be shown by solving Eqs. 14 and 15 for the corresponding error terms, which gives

$$ {e}_{yx}^{\prime }=\left[{b}_{yu}+\left({b}_{yx}-{b}_{yx}^{\prime}\right){b}_{xu}\right]u+\left({b}_{yx}-{b}_{yx}^{\prime}\right){e}_{xu}+{e}_{yu} $$
(19)
$$ {e}_{xy}^{\prime }=\left[{b}_{xu}-{b}_{xy}^{\prime}\left({b}_{yu}+{b}_{yx}{b}_{xu}\right)\right]u+\left(1-{b}_{xy}^{\prime }{b}_{yx}\right){e}_{xu}+{b}_{xy}^{\prime }{e}_{yu} $$
(20)

Thus, through reconsidering the “true” model given in Eq. 3 and, again, making use of the Darmois–Skitovich theorem, one concludes that the independence assumption is likely to be violated in both candidate models whenever a nonnormal confounder is present and \( \left[{b}_{yu}+\left({b}_{yx}-{b}_{yx}^{\prime}\right){b}_{xu}\right]{b}_{xu} \) and \( \left[{b}_{xu}-{b}_{xy}^{\prime}\left({b}_{yu}+{b}_{yx}{b}_{xu}\right)\right]{b}_{yu} \) deviate from zero.

Statistical inference

Significance tests to evaluate nonindependence of (linearly uncorrelated) variables have extensively been discussed in signal processing (Hyvärinen, Karhunen, & Oja, 2001). The first class of tests considered here uses the basic definition of stochastic independence, E[g1(υ1)g2(υ2)] − E[g1(υ1)]E[g2(υ2)] = 0 for any absolutely integrable functions g1 and g2. Thus, independence tests can be constructed using correlation tests of the form cor[g1(x), g2(eyx)] and cor[g1(y), g2(exy)] , where at least one function is nonlinear. These tests are easy to use because they essentially rely on the Pearson correlation test applied to nonlinearly transformed variables.

Two nonlinear functions may be of particular value in the present context: the square function, g(υ) = υ2, and the hyperbolic tangent function, g(υ) = tanh(υ). The square function constitutes a powerful candidate because covariances of predictor and error in the mis-specified model contain information of higher moments of the “true” predictor (Wiedermann and von Eye, 2015a)—that is,

$$ \operatorname{cov}\left(y,{e}_{xy}^2\right)={\rho}_{xy}{\left(1-{\rho}_{xy}^2\right)}^2{\gamma}_x $$
(21)
$$ \operatorname{cov}\left({y}^2,{e}_{xy}\right)={\rho}_{xy}^2\left(1-{\rho}_{xy}^2\right){\gamma}_x $$
(22)
$$ \operatorname{cov}\left({y}^2,{e}_{xy}^2\right)={\rho}_{xy}^2{\left(1-{\rho}_{xy}^2\right)}^2{\kappa}_x $$
(23)

In other words, the power of detecting nonindependence in yx increases with nonnormality of x. Although proofs for Eqs. 21 and 22 can be found in Wiedermann and von Eye (2015a, 2016), a proof of Eq. 23 is given in online Appendix A. Note that the covariances in Eqs. 21 and 23 involve squared residuals, which reveals a direct link to significance tests originally designed for detecting patterns of heteroscedasticity (cf. Wiedermann, Artner, & von Eye, 2017). Because heteroscedasticity occurs, among others, when the variance of the error can be expressed as a function, g, of independent variables—that is, \( Var\left({e}_{xy}|y\right)={\sigma}_{e_{xy}}^2g(y) \) (see, e.g., Kaufman, 2013), it follows that homoscedasticity tests that relate squared residuals to functions of model predictors (such as the Breusch–Pagan test), are likely to indicate patterns of nonconstant error variances in directionally mis-specified models.

Covariances based on the hyperbolic tangent function have been proposed by Hyvärinen (2010) and Hyvärinen and Smith (2013). The value tanh(υ) is the derivative of the log-density of an inverse hyperbolic cosine distribution that provides an approximation of the likelihood ratio of directionally competing models in the bivariate case. The inverse hyperbolic cosine distribution constitutes a reasonably close approximation for several leptokurtic observed variables (cf. Mumford & Ramsey, 2014). Although tanh-based correlation tests are ideally suited for symmetric nonnormal variables, the statistical power of this approach can be expected to be low for skewed variables.

Since the choice of g1 and g2 is almost arbitrary, nonlinear approaches do not constitute rigorous directionality tests. Additional Type II errors beyond cases of small sample sizes are introduced because testing all existing functions g1 and g2 is impossible. Recently, a promising alternative was suggested, the Hilbert-Schmidt Independence Criterion (HSIC; Gretton et al., 2008). The HSIC evaluates the independence of functions of random variables and is provably omnibus in detecting any dependence between two random variables in the large sample limit. Sen and Sen (2014) introduced the HSIC in the context of testing the independence of predictors and error terms of linear regression models and proposed a bootstrap approach to approximating the distribution of the test statistic.

Model selection

Reconsidering the possible outcomes for the three DDA components, it becomes evident that each model in Fig. 1 can be uniquely identified through specific DDA-component patterns. Table 1 summarizes these patterns, together with corresponding significance tests for each DDA component. In general, DDA model selection requires the specification of a target and an alternative model. The selection of whether xy or yx serves as the target model is arbitrary in terms of model comparison. However, following the logic of confirmatory model testing, we suggest that the target model reflect the substantive causal theory of interest and that the alternative model reflect the contradicting theory. The target model, for example xy, finds support when (1) the distribution of the response y is closer to normality than the distribution of x, (2) the residual distribution of xy is closer to normality than the residuals of yx, and (3) the independence assumption of residuals and predictor(s) holds for xy and is, at the same time, violated for model yx. Here, DDA’s independence procedures constitute the key element to test for the presence of unobserved confounders. In this case, independence must hold for xy and the independence assumption must be violated for yx in order to conclude that an effect is transmitted from x to y. Otherwise, one has to conclude that unmeasured confounders are present whenever the independence assumption is either violated or satisfied in both models (the latter possibility emerges from the fact that confounders can decrease the skewness/excess kurtosis of x and y to a degree that renders nonindependence no longer detectable). When independence tests allow a distinct decision, higher-moment tests for the observed variables and residuals are then used to test the directionality assumption inherent to the target model.

Table 1 Properties, significance tests, and patterns of DDA components for the three candidate models

A worked empirical example with SPSS

To make DDA accessible to applied researchers, we provide three SPSS macros. analyzes the distributional properties of the variables, evaluates the distributional properties of the residuals, and implements tests to evaluate the independence assumption regarding the competing candidate models (the DDA macros and introductory material can be downloaded from [http://www.ddaproject.com/]). All macros make use of positional arguments to control the input parameters. Table 2 summarizes the input parameters and options for the three macros, together with generic examples of their use. Computational details and introductions into implemented DDA inference methods are given in online Appendix B.

Table 2 Summary of arguments and their position in the three SPSS macros

To present a fully worked empirical example (consisting of preevaluating distributional requirements, building a valid target model, and subsequently using DDA) and demonstrate the use of the macros for DDA, we use data from a cross-sectional study on the triple-code model (Dehaene & Cohen, 1998). The triple-code model is used to explain the development of numerical cognition in children and proposes that numbers are represented in three different codes that serve different purposes in number processing. The analog magnitude code (AMC) represents numbers on a mental number line, includes knowledge of the proximity and size of numerical quantities, and is used in approximate estimations and magnitude comparisons. The auditory verbal code (AVC) represents numbers in syntactically organized word sequences that are important for verbal input/output, counting, and retrieving memorized arithmetic facts. The visual Arabic code (VAC) represents numerical quantities in Arabic format necessary for multidigit operations and parity judgments. Using the triple code model, von Aster and Shalev (2007) suggested a hierarchical developmental model of numerical cognition in which AMC is viewed as an inherited core system necessary to further develop the AVC and, as well, the VAC. In other words, the model posits a directional link between AMC and AVC (i.e., AMC → AVC) and AMC and VAC (i.e., AMC → VAC). In the present demonstration, we focus on the directionality of AMC and AVC.

Koller and Alexandrowicz (2010) collected AMC and AVC ability measures for 341 second- to fourth-grade elementary school children (185 girls and 156 boys, aged between 6 and 11 yrs.) using the Neuropsychological Test Battery for Number Processing and Calculation in Children (ZAREKI-R; von Aster, Weinhold Zulauf, & Horn, 2006). AMC sum scores are based on 31 dichotomous items (focusing on perceptual quantity estimation, placing numbers on an analog number line, counting backward, enumeration, magnitude comparison of spoken numbers, and contextual magnitude judgment), and AVC sum scores are based on 52 dichotomous items (mental calculations [addition, subtraction, and multiplication], repeating numbers forward and backward, and story problems). The sum scores were standardized prior to the analysis in order to improve interpretability. Because fourth-graders were most likely to solve all items of the AMC scale, we focused on the second- and third-grade children (n = 216; 123 girls and 93 boys) in order to avoid biased DDA results due to ceiling effects.

DDA was used to evaluate the two competing regression models (AMC → AVC vs. AVC → AMC) under adjustment for the covariates age (variable ), time needed for test completion in minutes (as an indirect measure of the perceived test difficulty; variable ), and preexisting difficulties with numerical quantities (0 = no, 1 = yes; variable ). All covariates preceded test performance in time, and we could exclude cyclic relations. Table 3 shows pairwise correlations and descriptive measures for all considered variables. Before applying DDA, two preevaluation stages are crucial to obtaining meaningful results: (1) evaluation of the distributional requirements for DDA and (2) carefully building a valid target model. Both stages are discussed in detail below.

Table 3 Bivariate Pearson correlations and descriptive measures of observed variables (means and standard deviations of AMC and AVC are based on sum scores)

Distributional requirements for DDA

DDA requires that the distributions of the observed variables deviate from normality. Thus, before estimating the target model (AMC → AVC), we evaluated the assumption of nonnormality of the variables. The AMC and AVC measures were negatively skewed, with excess-kurtosis values greater than zero (Table 3). The Shapiro–Wilk test rejected the null hypothesis of normality for both ability measures (ps < .001). Visual inspection was used to rule out the presence of outliers, and frequencies of the minimum/maximum scores were computed in order to assess potential floor/ceiling effects. For the AVC scale, no participant reached the minimum or maximum score. For AMC, no participants received the minimum, and 14 out of 216 (6.5%) reached the maximum score, which is clearly below the commonly used cutoff of 15%–20% to define a ceiling effect (e.g., Lim et al., 2015; Terwee et al., 2007). Overall, the variables can be considered in line with the distributional requirements of DDA.

Estimating and validating the target model

We started by partialling out the effect of the binary indicator “difficulties with numbers” using separate OLS regressions and extracted residuals as “purified” AMC and AVC measures—that is, = – (0.032 – 0.850×) and = – (0.035 – 0.795×). Figure 3 shows the univariate distributions and the bivariate scatterplot (with the LOWESS smoothed line superimposed) for and . In a similar fashion, we partialled out the effect of the binary indicator on the remaining continuous covariates—that is, = – (8.222 + 0.111 ×), and = – (28.411 + 2.535 ×).

Fig. 3
figure 3

Univariate distributions (main diagonal) and scatterplot (upper right panel) of “purified” analog magnitude code (AMC_residuals) and auditory verbal code (AVC_residuals) scores, including the bivariate LOWESS smoothed line for 216 second- and third-grade elementary school children

Next, we estimated the target model () under adjustment for the continuous covariates ( and ) and evaluated the validity of the model using regression diagnostics. Table 4 summarizes the results for both the target model (upper panel) and the alternative model (lower panel). The linearity assumption of the target model was confirmed through inspection of the LOWESS plots and inclusion of higher-order terms. Adding quadratic terms for the predictors did not significantly improve the model fit (e.g., including squared values of increased the model R2 from .507 to .510, which was nonsignificant on the basis of a 5% significance level). Variance inflation factors for the predictors varied from 1.16 to 1.27, suggesting the absence of multicollinearity issues. Furthermore, we estimated leverage values and Cook’s distances for the model, to check for the presence of highly influential observations. We excluded one observation from the subsequent DDA with a maximum Cook’s distance of .195 (95% of the observations had a Cook’s distance smaller than or equal to .023) and a leverage value of 0.075, which exceeded three times the average leverage values.

Table 4 Results of the two competing models (B = unstandardized coefficients, Std. Error = standard error, Beta = standardized coefficients)

Evaluating the direction of effect

To test whether the target model was indeed better-suited to approximate the data-generating mechanism, we first applied the macro. After setting the working directory C:\myproject (used to save and read temporary files during computations) and reading and activating the dataset data_example.sav using the code

figure ae

the following command gives the results for observed-variable-based direction dependence tests using 1,000 bootstrap samples (used for the construction of confidence interval [CI] limits), a confidence level of 95%, and two-sided significance tests of normality:

figure af

The corresponding output is given in Box 1. The upper panel summarizes the results of D’Agostino skewness and Anscombe–Glynn kurtosis tests for the putative response (columns 1–3) and predictor (columns 4–6). Skewness and excess-kurtosis values were close to zero for and we can retain the null hypothesis of normality. In contrast, significantly deviated from normality with respect to skewness. The results for excess-kurtosis estimates point in the same direction. The lower panel of Box 1 reports the 95% nonparametric bootstrap CIs for the differences in skewness Δ(γ) =  ∣γAMC_r∣−∣γAVC_r∣ and excess kurtosis Δ(κ) = ∣κAMC_r∣−∣κAVC_r∣. Although is significantly more skewed than , the difference in excess kurtosis was nonsignificant. Overall, the third-moment estimates provide evidence in line with direction dependence requirements necessary for .

Box 1. Results from

figure am

Next, we evaluated the properties of the residuals obtained from the two competing models. The following command performs residual-based direction dependence tests (again, using two-sided tests, 1,000 bootstrap samples for constructing CI limits, and a 95% confidence level):

figure ao

The upper panel of Box 2 summarizes separate skewness and excess-kurtosis tests of the regression residuals. Columns 1–3 refer to the target model, and columns 4–6 give the results for the alternative model. Although the higher-moment estimates were larger (in absolute values) for the alternative model, we cannot the reject the null hypothesis of normality at the 5% level. Similar results were obtained for the higher-moment difference measures\( \Delta \left({\gamma}_e\right)=\mid {\gamma}_e^{\left( AVC\_r\to AMC\_r\right)}\mid -\mid {\gamma}_e^{\left( AMC\_r\to AVC\_r\right)}\mid \) and \( \Delta \left({\kappa}_e\right)=\mid {\kappa}_e^{\left( AVC\_r\to AMC\_r\right)}\mid -\mid {\kappa}_e^{\left( AMC\_r\to AVC\_r\right)}\mid \) (see the lower panel of Box 2). Both the asymptotic higher-moment difference tests (columns 1–3) and 95% nonparametric bootstrap CIs (last two columns) suggested that the two models are not distinguishable in terms of their residual distributions. Thus, no clear-cut decision is possible for this component.

Box 2. Results from

In the final step, we analyzed the independence properties of the two candidate models, which is the most important element for interpreting OLS estimates as causal. The command

figure as

computes Breusch–Pagan () homoscedasticity tests and nonlinear correlation tests using the square function. The results are summarized in Box 3. The upper section of the output gives the results for the BP and the robust-BP tests for the and the . Overall, the results were clearly in favor of the target model—that is, the homoscedasticity assumption holds for the target model and, at the same time, is violated for the alternative model. In addition, we used scatterplots of the standardized predicted values and standardized residuals for both models as a visual aid (see Fig. 4). No conspicuous patterns were observed for the target model, whereas the plot for the alternative model suggested an inverse U-shaped pattern. The lower section of Box 3 summarizes the results of nonlinear correlation tests. In general, given a selected function g, Pearson correlation coefficients, t values, and p values are computed for cor[g(pred), e], cor[pred, g(e)], and cor[g(pred), g(e)]. In the present example, nonlinear correlation tests based on the square function again clearly favored the target model; that is, all tests were nonsignificant for the target model, and at the same time, all tests rejected the null hypothesis for the alternative model.Footnote 4 Finally, the command

Fig. 4
figure 4

Residual plot for the target () and the alternative () models, with bivariate LOWESS smoothed lines

Box 3. Results from using homoscedasticity tests and nonlinear correlation tests based on the square function

figure ba
figure bf

computes HSIC tests for the two competing models using 500 bootstrap samples. Box 4 gives the corresponding output. Note that the HSIC will be zero if and only if the predictor and the error term are stochastically independent. Again, a nonsignificant result was observed for the , whereas the HSIC reached significance for the . In sum, all independence measures indicated that is more likely to hold for the present dataset.

Box 4. Results from using the HSIC test

figure bl

Considering the overall results of DDA for the numerical-cognition example, we conclude that, taking into account the covariates, AVC is indeed more likely to reflect the response, and AMC is more likely to be on the explanatory side. In other words, on the basis of the present sample, the DDA results empirically support von Aster and Shalev’s (2007) hierarchical developmental model of numerical cognition.

Discussion

DDA allows researchers to test hypotheses compatible with the directional relation between pairs of variables while adjusting for covariates that possibly contribute to the causal process. This empirical falsification approach is based on the translation of a substantive causal theory into a linear target model that is then compared with the corresponding alternative model. The two models differ in the direction that is hypothesized for the causal process. DDA component patterns can then be used to either retain the target model, retain the directionally competing model, or conclude that no distinct decisions are possible due to the presence of unmeasured confounders. Here, it is important to reiterate that directional conclusions derived from DDA component patterns are based on the operationalization of latent constructs \( \mathcal{X} \) and \( \mathcal{Y} \) using the linear model as an approximation of an unknown “true” functional relation \( \mathcal{F} \). Trustworthiness of DDA, thus, ultimately depends on both, the quality of operationalization and the validity of the linear model for the description of the causal mechanism. Although both requirements essentially apply to any linear modeling approach, they deserve particular attention in the context of DDA.

Because higher moments of variables constitute the key elements to select directionally competing models, DDA assumes that nonnormality of variables reflects inherent distributional characteristics of the constructs under study. Although the phenomenon of nonnormal variable distributions and its occurrence in practice have extensively been studied in the psychometric literature (Blanca, Arnau, López-Montiel, Bono, & Bendayan, 2013; Cain, Zhang, & Yuan, 2017; D. L. Cook, 1959; Lord, 1955; Micceri, 1989), not every form of nonnormality makes variables eligible for DDA. In classical test theory, for example, the impact of discrimination and difficulty of a measurement instrument on the relation between latent traits and true score is well understood. To ensure that the observed score distributions adequately reflect distributional properties of a latent trait, the test characteristic curve should go straight through the range of the trait distribution, which is usually achieved by using items with a broad range of difficulties (Lord & Novick, 1968, p. 392). In addition, item response theory (IRT) modelsFootnote 5 such as the Rasch model (Rasch, 1960/1980, for dichotomous data) and the partial credit model (Masters, 1982, for polytomous data) are valuable alternatives. These models (1) come with empirical measures to evaluate the adequacy of describing a given dataset, (2) provide accordingly “weighted” parameter estimates (i.e., taking into account item difficulties), and (3) if the measurement model holds, exhibit the feature of specific objectivity (i.e., items can be compared irrespective of the distribution of person parameters and subjects can be compared using any proper set of items), which allows the most adequate estimation of the underlying trait distributions. For example, data on numerical cognition used for illustrative purposes were shown to be in line with the Rasch model (see Koller & Alexandrowicz, 2010), which implies that raw scores are sufficient statistics for the latent person abilities. In contrast, applying DDA in cases in which nonnormality of variables is a by-product of poor item selection, scaling (Ho & Yu, 2015), or the result of ceiling/floor effects will lead to biased results (note that, in the empirical example, the fourth grade children who were most likely to solve all scale-specific items were excluded to reduce the risk of biases due to ceiling effects). Overall, selecting high-quality measurement instruments at the study planning stage, or carefully evaluating psychometric properties of secondary data are central steps toward meaningful DDA outcomes.

Explanatory modeling, in general, requires that selected statistical models f can easily be linked to the corresponding theoretical model \( \mathcal{F} \) (Shmueli, 2010). Because the “true” data-generating mechanism \( \mathcal{Y}=\mathcal{F}\left(\mathcal{X}\right) \) is unknown in any statistical modeling approach (Cudeck & Henly, 2003) empirical examinations whether y = f(x) is close enough to \( \mathcal{Y}=\mathcal{F}\left(\mathcal{X}\right) \) are impossible. Appropriateness of f must be established indirectly through critically evaluating model validity using regression diagnostics (cf. Belsley, Kuh, & Welsch, 1980; R. D. Cook & Weisberg, 1982). Several model checks are indispensable before applying DDA. First, one needs to ensure that the assumption of linearity is justified (in the illustrative example we used visual diagnostics and evaluated changes in R2 values when adding higher polynomials of all continuous variables). Second, evaluating potential issues of multicollinearity (e.g., inspecting pairwise predictor correlations and VIFs) are necessary to avoid biased inference due to inflated standard errors. Third, absence of outliers and highly influential data points must be confirmed (e.g., via examining Cook’s distances, leverage statistics, or deleted studentized residuals). Ideally, the process of building a valid target model and the subsequent evaluation of its directional properties constitute two separate steps. This implies that unintended DDA outcomes should not be used as a basis to delete “misbehaving” data points.

The case of nonnormal “true” errors

The DDA framework presented here assumes that the “true” error follows a normal distribution. Although, in best practice applications, normality of residuals should routinely be evaluated to guarantee valid statistical inference (Box & Watson, 1962; Hampel, 1973; Pearson, 1931; White & MacDonald, 1980), normality is not required for OLS coefficients to be the best linear unbiased estimates. Normal “true” errors are particularly important for residual-distribution-based DDA tests when measures of both, skewness and excess kurtosis, are considered simultaneously because normality of the correctly specified error then serves as a benchmark for model comparison. However, when one only focuses on the skewness of competing error terms, model selection can be performed as long as \( {\gamma}_{e_{yx}} \) = 0—that is, no explicit assumptions are made concerning \( {\kappa}_{e_{yx}} \). Model selection should then be based on nonparametric bootstrap CIs of skewness differences instead of the asymptotic skewness difference test (cf. Wiedermann & von Eye, 2015c). Reversely, when solely focusing on the excess kurtosis of error terms, no explicit assumptions are made concerning symmetry of the “true” error distribution and, as long as \( {\kappa}_{e_{yx}}=0 \) holds for the “true” model, \( {\gamma}_{e_{yx}} \) is allowed to vary within the range \( -\sqrt{2} \) to \( \sqrt{2} \) according to the skewness–kurtosis inequality κ ≥ γ2 − 2 (cf. Teuscher & Guiard, 1995).

Although DDA, based on the skewness and excess kurtosis of the observed variables, also requires normality of the “true” errors, focusing on either skewness or the excess kurtosis relaxes distributional assumptions about the “true” error in the same fashion (for a detailed discussion on distinguishing directionally competing models under error nonnormality, see also Wiedermann & Hagmann, 2015). In addition, alternative DDA measures based on higher-order correlations \( {\rho}_{ij}\left(x,y\right)={\operatorname{cov}}_{ij}\left(x,y\right)/\left({\sigma}_x^i{\sigma}_y^j\right) \) with covij(x, y) = E[(xE[x])i (yE[y])j] are available that do not make any assumptions about the “true” error distribution. Dodge and Rousson (2001) showed that ρxy = ρ12(x, y)/ρ21(x, y) holds whenever the “true” predictor is asymmetrically distributed without imposing distributional assumptions on the error. Thus, one obtains \( {\rho}_{12}^2\left(x,y\right)<{\rho}_{21}^2\left(x,y\right) \) under xy and \( {\rho}_{12}^2\left(x,y\right)>{\rho}_{21}^2\left(x,y\right) \) under yx independent of the error term distribution. A nonparametric bootstrap approach can again be carried out for statistical inference. Similarly, kurtosis-based DDA measures can be obtained when focusing on \( {\rho}_{13}^2\left(x,y\right) \) and \( {\rho}_{31}^2\left(x,y\right) \) (cf. Wiedermann, 2017). Implementing additional DDA measures for potentially nonnormal “true” errors in is planned in the future.

Methods to assess independence of predictor(s) and error can straightforwardly be applied without any further modification even when the “true” error is nonnormal. The reason for this is that the Darmois–Skitovich theorem, as applied in the present context, does not impose distributional assumptions on the “true” error. Nonindependence of predictor(s) and error will hold when at least one common variable is nonnormal. Thus, evaluating the independence assumption of competing models can be carried out when (1) only the “true” predictor, (2) only the “true” error, or (3) both deviate from normality as along as the product of corresponding coefficients (see Eq. 18) is unequal to zero. However, results of competing BP-tests to assess patterns of heteroscedasticity in the two candidate models must be interpreted with caution when residuals of both models deviate from normality. In this case, Type I error rates of the test will be distorted and directional decisions must be based on Koenker’s robust BP test.

Power and sample size considerations: What we know so far

To provide guidelines for the necessary number of observations to achieve sufficient power, we summarize previous simulation studies on DDA components and focus on three factors that impact empirical power rates: The magnitude of nonnormality, the magnitude of the causal effects, and sample size. Dodge and Rousson (2016) evaluated the power of nonparametric bootstrap CIs of Δ(γ) = |γx|–|γy| and Δ(κ) = κx|–|κy| and concluded that skewness-based model selection outperformed the kurtosis-based approach in terms of statistical power to detect the “true” model. Here, for small effects (R2 = .25) and skewness values of 2, sample sizes as small as n = 50 may be sufficient to achieve a statistical power close to 80%. In contrast, for kurtosis-based selection, sample sizes of n = 500 and excess-kurtosis values larger 4 are needed to achieve similar statistical power.

Wiedermann and von Eye (2015b) evaluated power properties of residual distribution-based methods considering separate D’Agostino skewness tests, the asymptotic skewness difference test, and nonparametric bootstrap CIs for \( \Delta \left({\gamma}_e\right)=\mid {\gamma}_{e_{xy}}\mid -\mid {\gamma}_{e_{yx}}\mid \), and concluded that acceptable power levels can already be observed for n = 75 when causal effects are small (ρxy = .25) and the true predictor is sufficiently skewed (i.e., γx ≥ 2). Because model selection based on separate normality tests proves more powerful than tests based on Δ(γe), n = 50 may already be sufficient for separate D’Agostino tests. In general, at least n = 125 is required for less skewed variables (e.g., γx = 1) and lower correlations (e.g., ρxy = .25). Model selection based on excess-kurtosis differences of residual distributions was evaluated by Wiedermann (2015). Again, separate Anscombe–Glynn tests outperformed procedures based on the difference of excess-kurtosis estimates. Here, for n = 200 and ρxy = .4, excess-kurtosis values larger than 4 are necessary for power rates close to 80%.

Wiedermann, Artner, and von Eye (2017) compared the performance of nine homoscedasticity tests to evaluate the independence assumption in competing models and showed that the BP-test was the most powerful procedure to select the correct model. For slightly skewed predictors (γx = 0.75), large effects and large sample sizes n ≥ 400 may be required to achieve sufficient power. For γx ≥ 1.5 and medium effect sizes, at least n = 200 may be required. Quite similar results were obtained for model selection based on nonlinear correlation tests of the form \( cor\left(x,{e}_{yx}^2\right) \) (Wiedermann & von Eye, 2016). However, γx ≥ 1.5 and large effects are necessary to obtain power rates beyond 80% when n ≥ 200. Systematic simulation experiments that (1) compare the statistical power of several other independence tests and (2) evaluate all DDA components simultaneously constitute important future endeavors.

Further application scenarios and extensions

It is important to note that the proposed method is not restricted to the presented standard multiple regression setup. DDA is also applicable in other scenarios in which directionality issues have been deemed to be untestable. For example, when a statistical relation between two variables, x and y, has been established, researcher may further entertain hypotheses about the role of a third measured variable. Whether this third variable (m) should be conceptualized as a mediator (an intervening variable that transmits the effect from x to y) or as an observed confounder cannot be answered with standard statistical methods (MacKinnon, Krull, & Lockwood, 2000). From a DDA perspective, distinguishing between these models reduces to separately evaluating the directionality of x and m (i.e., whether xm or mx should be preferred) and m and y (i.e., whether my or ym holds for the data) provided that nonnormality requirements are fulfilled (for extensions of residual- and independence-based DDA to mediation models, see Wiedermann & von Eye, 2015c, 2016). Furthermore, the application of DDA may not be restricted to observational studies. Directionality issues may also occur in experimental studies—in particular, those designed to test hypotheses that go beyond total effects in randomized trials. Here, mediation models may, again, provide sound explanations how experimental interventions causally affect the target outcome (Bullock, Green, & Ha, 2010; Heckman & Smith, 1995; Imai, Keele, & Tingley, 2010). However, even when the predictor is under experimental control, it is well-known that neither the direction (Wiedermann & von Eye, 2015c) nor the magnitude of the causal effect of the mediator on the outcome can be identified uniquely without imposing strong assumptions on data, assumptions that are similar to observational studies (Imai, Tingley, & Yamamoto, 2013; Keele, 2015). Again, DDA may help to gain further insight through evaluating competing mediator-outcome paths while adjusting for an experimentally controlled predictor.

Extensions of the direction dependence methodology proposed in this article can go in a number of directions. First, developing DDA for moderation models would enable researchers to test the direction of effect while accounting for a third variable that modifies the relation between predictor and response (the fact that the nature of the moderator effect may depend on the direction of the postulated model has been shown by Judd & Kenny, 2010). Similarly, future work is needed to study principles of direction dependence in polynomial (i.e., models that consider higher-order terms, cf. Aiken & West, 1991) and more general linearizable regression models (i.e., nonlinear regression functions that can be linearized through proper variable transformations). Another possible extension concerns the complexity of the research design. Although the presented framework is designed for single-level data, developing DDA for multilevel regression models (Raudenbush & Bryk, 2002) would allow to account for hierarchical (nested) data structures. Further, throughout the article, we assumed that the “true” predictor is measured without measurement error. Although first attempts to extend DDA components to measurement error models are given in von Eye and Wiedermann (2014) and Wiedermann, Merkle, and von Eye (2018), extending direction dependence to latent variable models may overcome potential biases in directional decisions resulting from imprecise measurement of constructs. Finally, the present study focused on cases in which the tentative predictor and the tentative response are continuous variables (covariates can either be continuous or categorical). The reason for this is that both candidates models (xy and yx) must be specified as standard linear regression models (similarly, the proposed SPSS macros are designed to evaluate two competing standard linear models). Although previous studies (cf. Inazumi et al., 2011; Peters, Janzing, & Schölkopf, 2011; von Eye & Wiedermann, 2016, 2017; Wiedermann & von Eye, 2018) discussed principles of direction dependence when both variables are categorical in nature, extending DDA to the generalized linear modeling framework (McCullagh & Nelder, 1989) would be most promising for evaluating causal relations among categorical, count, and continuous variables.

Author note

We thank the two anonymous reviewers, Wes Bonifay, Francis Huang, Edgar C. Merkle, Anna P. Nutt, and Phillip K. Wood for their constructive comments on an earlier version of the article. We are also indebted to Ingrid Koller for providing the data used for illustrative purposes.