First, we introduce structural equation models (SEMs) before recapping linear anchor regression. In Sect. 2.3, we switch perspectives from modelling the conditional expectation to transformation models which capture the entire conditional distribution. The notation used in this work is described in Appendix A.
Structural equation models
Let \(Y\) be a response which takes values in \(\mathbb {R}\), \({\varvec{X}}\) be a random vector of covariates taking values in \(\mathbb {R}^p\), \({\varvec{H}}\) denotes hidden confounders with sample space \(\mathbb {R}^d\), and \({\varvec{A}}\) are exogenous variables (called anchors, due to exogeneity; source node in the graph in Fig. 1) taking values in \(\mathbb {R}^q\). The SEM governing linear anchor regression is given by
$$\begin{aligned} \begin{pmatrix} Y\\ {\varvec{X}}\\ {\varvec{H}}\end{pmatrix} \leftarrow {\mathbf {B}}\begin{pmatrix} Y\\ {\varvec{X}}\\ {\varvec{H}}\end{pmatrix} + {\mathbf {M}}{\varvec{A}}+ {\varvec{\varepsilon }}, \end{aligned}$$
(1)
with \((1+p+d) \times (1+p+d)\)-matrix \({\mathbf {B}}\) which corresponds to the structure of the SEM in terms of a directed acyclic graph (DAG), the effect of \({\varvec{A}}\) enters linearly via the \((1 + p +d ) \times q\)-matrix \({\mathbf {M}}\), and \({\varvec{\varepsilon }}\) denotes the error term with mutually independent components. The “\(\leftarrow \)” symbol is algebraically a distributional equality sign. It emphasizes the structural character of the SEM, saying that, e.g., \(Y\) is only a function of the parents of the node \(Y\) in the structural DAG and the first entry in the additive component \(({\mathbf {M}}{\mathbf {{ A}}}+ {\varvec{\varepsilon }})\).
The anchors \({\varvec{A}}\) may be continuous or discrete. In the special case of discrete anchors each level can be viewed as an “environment”.
We define perturbations as intervening on \({\varvec{A}}\), e.g., by \({{\,\mathrm{do}\,}}({\varvec{A}}= {\varvec{a}})\), which replaces \({\varvec{A}}\) by \({\varvec{a}}\) in the SEM while leaving the underlying mechanism, i.e., the coefficients in the SEM, unchanged. In this work we restrict ourselves to \({{\,\mathrm{do}\,}}\)- (Pearl 2009) and \({{\,\mathrm{push}\,}}\)-interventions (Markowetz et al. 2005) on \({\varvec{A}}\), which in turn lead to shifts in the distribution of \({\varvec{X}}\). In essence, \({{\,\mathrm{do}\,}}\)-interventions replace a node in a graph with a deterministic value, whereas \({{\,\mathrm{push}\,}}\)-interventions are stochastic and only “push” the distribution of the intervened random variable towards, e.g., having a different mean. Since \({\varvec{A}}\) is exogenous and a source node in the graph, the specific type of intervention does not play a major role. Christiansen et al. (2021) show that under the above conditions OOD generalization is possible in linear models.
Linear anchor regression
Linear \(L_2\) anchor regression with its corresponding causal regularization estimates the linear regression parameter \({\varvec{\beta }}\) as
$$\begin{aligned} \hat{{\varvec{\beta }}} = \mathop {\mathrm {arg\,min}}\limits _{{\varvec{\beta }}} \bigg \{ \left\Vert ({{\,\mathrm{Id}\,}}- \varvec{\Pi }_{{\mathbf {A}}})({\varvec{y}}-{\mathbf {X}}{\varvec{\beta }})\right\Vert _2^2/n \ \\ +\,\gamma \left\Vert \varvec{\Pi }_{{\mathbf {A}}}({\varvec{y}}- {\mathbf {X}}{\varvec{\beta }})\right\Vert _2^2/n \bigg \}, \end{aligned}$$
where \(0 \le \gamma \le \infty \) is a regularization parameter and \(\varvec{\Pi }_{{\mathbf {A}}}= {\mathbf {A}}({\mathbf {A}}^\top {\mathbf {A}})^{-1}{\mathbf {A}}^\top \) denotes the orthogonal projection onto the column space of the anchors (Rothenhäusler et al. 2021). For \(\gamma = 1\) one obtains ordinary least squares, \(\gamma \rightarrow \infty \) corresponds to two-stage least squares as in instrumental variables regression and \(\gamma = 0\) is partialling out the anchor variables \({\varvec{A}}\) (which is equivalent to ordinary least squares when regressing \(Y\) on \({\varvec{X}}\) and \({\varvec{A}}\)). Causal regularization encourages, for large values of \(\gamma \), uncorrelatedness of the anchors \({\varvec{A}}\) and the residuals. As a procedure, causal regularization does not depend at all on the SEM in Eq. (1). However, as described below, the method inherits a distributional robustness property, whose formulation depends on the SEM in Eq. (1).
Rothenhäusler et al. (2021) establish the duality between the \(L_2\) loss in linear anchor regression and optimizing a worst case risk over specific shift perturbations. The authors consider shift perturbations \({\varvec{\nu }}\), which are confined to be in the set
$$\begin{aligned} C_\gamma := \bigg \{{\varvec{\nu }}: {\varvec{\nu }}= {\mathbf {M}}{\varvec{\delta }}, \; {\varvec{\delta }} \text { independent } \text {of } {\varvec{\varepsilon }}, \;\\ \mathbb {E}\left[ {\varvec{\delta }}{\varvec{\delta }}^\top \right] \preceq \gamma \mathbb {E}\left[ {\mathbf {A}}{\mathbf {A}}^\top \right] \bigg \}, \end{aligned}$$
and which generate the perturbed response \(Y^{\varvec{\nu }}\), and covariates \({\varvec{X}}^{\varvec{\nu }}\) via
$$\begin{aligned} \begin{pmatrix} Y^{\varvec{\nu }}\\ {\varvec{X}}^{\varvec{\nu }}\\ {\varvec{H}}^{\varvec{\nu }}\end{pmatrix} \leftarrow {\mathbf {B}}\begin{pmatrix} Y^{\varvec{\nu }}\\ {\varvec{X}}^{\varvec{\nu }}\\ {\varvec{H}}^{\varvec{\nu }}\end{pmatrix} + {\varvec{\nu }}+ {\varvec{\varepsilon }}. \end{aligned}$$
The set \(C_\gamma \) contains all vectors which lie in the span of the columns of \({\mathbf {M}}\) and thus in the same direction as the exogenous contribution \({\mathbf {M}}{\varvec{A}}\) of the anchor variables. The average size and direction of the perturbations \({\varvec{\delta }}\) is limited by \(\gamma \) and the centered anchors’ variance-covariance matrix. Now, the explicit duality between the worst case risk over all shift perturbations of limited size and the linear \(L_2\) anchor loss is given by
$$\begin{aligned} \sup _{{\varvec{\nu }}\in C_\gamma }&\mathbb {E}\left[ (Y^{\varvec{\nu }}- ({\varvec{X}}^{\varvec{\nu }})^\top {\varvec{\beta }})^2\right] \nonumber \\&\quad =\mathbb {E}\left[ (({{\,\mathrm{Id}\,}}- P_{\varvec{A}})(Y-{\varvec{X}}^\top {\varvec{\beta }}))^2\right] \nonumber \\&\qquad +\gamma \mathbb {E}\left[ (P_{\varvec{A}}(Y- {\varvec{X}}^\top {\varvec{\beta }}))^2\right] , \end{aligned}$$
(2)
where \(P_{\varvec{A}}= \mathbb {E}[\cdot \vert {\varvec{A}}]\) is the population analogue of \(\varvec{\Pi }_{{\mathbf {A}}}\). We note that the right-hand side is the population analogue of the objective function in anchor regression. Hence, causal regularization in anchor regression provides guarantees for optimizing worst-case risk across a class of shift perturbations. The details are provided in Rothenhäusler et al. (2021).
Transformation models
We now switch perspective from models for the conditional mean to modelling conditional distributions. Specifically, we consider transformation models (Hothorn et al. 2014). TMs decompose the conditional distribution of \(Y\vert {\varvec{x}}\) into a pre-defined distribution function \(F_Z\), with log-concave density \(f_Z\), and a (semi-) parametric transformation function \(h(y\vert {\varvec{x}})\), which is monotone non-decreasing in \(y\)
$$\begin{aligned} F_{Y\vert {\varvec{x}}}(y\vert {\varvec{x}}) = F_Z(h(y\vert {\varvec{x}})). \end{aligned}$$
This way, the problem of estimating a conditional distribution simplifies to estimating (the parameters of) the transformation function \(h= F_Z^{-1} \circ F_{Y\vert {\varvec{x}}}\) (since \(F_Z\), called inverse link, is pre-specified and parameter-free). Depending on the complexity of \(h\), very flexible conditional distributions can be modelled. Hothorn et al. (2018) give theoretical guarantees for the existence and uniqueness of the transformation function \(h\) for absolute continuous, count and ordered discrete random variables. For the sake of generality, \(h\) is parametrized in terms of a basis expansion in the argument \(y\) which can be as simple as a linear function in \(y\) or as complex as a spline to model a smooth function in \(y\).
In this work, we assume the transformation function for a continuous response can be additively decomposed into a linear predictor in \({\varvec{x}}\) and a smooth function in \(y\) which is modelled as a Bernstein polynomial of order P with parameters \({\varvec{\theta }}\in \mathbb {R}^{P+1}\) (Hothorn et al. 2018), such that \(h(y \vert {\varvec{x}}) = {\varvec{b}}_{\text {Bs},P}(y)^\top {\varvec{\theta }}+ {\varvec{x}}^\top \beta \). Monotonicity of \({\varvec{b}}_{\text {Bs},P}(y)^\top {\varvec{\theta }}\) and thereby of \(h(y\vert {\varvec{x}})\) can then be enforced via the P linear constraints \(\theta _1 \le \theta _2 \le \dots \theta _{P+1}\). In case of an ordinal response taking values in \(\{y_1, y_2, \dots , y_K\}\), the transformation function is a monotone increasing step function, \(h(y_k \vert {\varvec{x}}) = \theta _k + {\varvec{x}}^\top {\varvec{\beta }}\), for \(k = 1, \dots , K - 1\) and the additional constraint \(\theta _K = + \infty \). We summarize a transformation model based on its inverse link function \(F_Z\), basis \({\varvec{b}}\), which may include covariates, and parameters \({\varvec{\vartheta }}\), such that \(F_{Y\vert {\varvec{x}}}(y\vert {\varvec{x}}) = F_Z\left( {\varvec{b}}(y,{\varvec{x}})^\top {\varvec{\vartheta }}\right) \). For instance, for a transformation model with continuous response and explanatory variables \({\varvec{x}}\) we thus use \({\varvec{b}}(y,{\varvec{x}}) = ({\varvec{b}}_{\text {Bs},P}(y)^\top , {\varvec{x}}^\top )^\top \) and \({\varvec{\vartheta }}= ({\varvec{\theta }}^\top , {\varvec{\beta }}^\top )^\top \), yielding \(h(y\vert {\varvec{x}}) = {\varvec{b}}_{\text {Bs},P}(y)^\top {\varvec{\theta }}+ {\varvec{x}}^\top {\varvec{\beta }}\). For a TM with ordinal response we substitute the Bernstein basis with a dummy encoding of the response, which we denote by \({\tilde{{\varvec{y}}}}\) (e.g., Kook et al. 2022). Also note that the unconditional case is covered by the above formulation as well, by omitting all explanatory variables from the TM’s basis.
Figure 2 illustrates the intuition behind transformation models. The transformation function (upper right panel) transforms the complex, bimodal distribution of \(Y\) (lower panel) to \(F_Z= F_{{{\,\mathrm{MEV}\,}}}\), the standard minimum extreme value distribution (upper left panel). An analogous figure for ordinal outcomes is published in Kook et al. (2022, Fig. 1).
Definition 1
(Transformation model, Definition 4 in Hothorn et al. (2018)) The triple (\(F_Z\), \({\varvec{b}}\), \({\varvec{\vartheta }}\)) is called transformation model.
Example 1
(Linear regression) The normal linear regression model (Lm) is commonly formulated as \(Y= \beta _0 + {\varvec{x}}^\top {\tilde{{\varvec{\beta }}}} + \varepsilon \), \(\varepsilon \sim \mathcal {N}\left( 0, \sigma ^2\right) \), or \(Y\vert {\varvec{x}}\sim {\mathcal {N}}\left( \beta _0 + {\varvec{x}}^\top {\tilde{{\varvec{\beta }}}}, \sigma ^2\right) .\) For a distributional treatment we write the above expression as
$$\begin{aligned} F_{Y\vert {\varvec{x}}}(y\vert {\varvec{x}}) = \varPhi \left( \frac{y- \beta _0 - {\varvec{x}}^\top {\tilde{{\varvec{\beta }}}}}{\sigma }\right) \end{aligned}$$
(3)
$$\begin{aligned} = {\varPhi }(\vartheta _1 + \vartheta _2 y- {\varvec{x}}^\top {\varvec{\beta }}), \end{aligned}$$
(4)
which can be understood as a transformation model by letting \(\vartheta _1 = - \beta _0 / \sigma \), \(\vartheta _2 = 1/\sigma \) and \({\varvec{\beta }}= {\tilde{{\varvec{\beta }}}} / \sigma \). Formally, it corresponds to the model
$$\begin{aligned} (F_Z, {\varvec{b}}, {\varvec{\vartheta }}) = \left( \varPhi , \left( 1, y, {\varvec{x}}^\top \right) ^\top , \left( \vartheta _1, \vartheta _2, -{\varvec{\beta }}^\top \right) ^\top \right) . \end{aligned}$$
Note that the baseline transformation, \(h(y\vert {\varvec{X}}= 0)\), is constrained to be linear with constant slope \(\vartheta _2\). Due to the linearity of \(h\) and the choice \(F_Z=\varPhi \), the modeled distribution of \(Y\vert {\varvec{x}}\) will always be normal with constant variance. By parametrizing \(h\) in a smooth way, we arrive at much more flexible conditional distributions for \(Y\vert {\varvec{x}}\).
The parameters of a TM can be jointly estimated using maximum-likelihood. The likelihood can be written in terms of the inverse link function \(F_Z\), which makes its evaluation computationally more convenient. For a single datum \((y, {\varvec{x}})\) with potentially censored response \(y\in (\underline{y}, {\overline{y}}]\) the log-likelihood contribution is given by (Lindsey et al. 1996)
$$\begin{aligned}&\ell ({\varvec{\vartheta }}; y, {\varvec{x}}) = \\ {}&{\left\{ \begin{array}{ll} \log f_Z\left( {\varvec{b}}(y,{\varvec{x}})^\top {\varvec{\vartheta }}\right) + \log \left( {\varvec{b}}'(y,{\varvec{x}})^\top {\varvec{\vartheta }}\right) , &{}{} \text{ exact, } \\ \log F_Z\left( {\varvec{b}}\left( \overline{y},{\varvec{x}}\right) ^\top {\varvec{\vartheta }}\right) , &{}{} \text{ left, }\\ \log \left( 1 - F_Z\big ({\varvec{b}}(\underline{y},{\varvec{x}})^\top {\varvec{\vartheta }}\big ) \right) , &{}{} \text{ right, } \\ \log \left( F_Z\left( {\varvec{b}}\left( \overline{y},{\varvec{x}}\right) ^\top {\varvec{\vartheta }}\right) - F_Z\big ({\varvec{b}}(\underline{y},{\varvec{x}})^\top {\varvec{\vartheta }}\big ) \right) , &{}{} \text{ interval. } \end{array}\right. } \end{aligned}$$
The likelihood is always understood as conditional on \({\varvec{X}}\) when viewing the covariables as random. Allowing for censored observations is of practical importance, because in many applications the response of interest is not continuous or suffers from inaccuracies, which can be taken into account via uninformative censoring.
Example 2
(Lm, cont’d) For an exact datum \((y, {\varvec{x}})\) the log-likelihood in the normal linear regression model is given by
$$\begin{aligned} \ell (\vartheta _1, \vartheta _2, {\varvec{\beta }}; y, {\varvec{x}}) = \log \phi \big (\vartheta _1 + \vartheta _2 y- {\varvec{x}}^\top {\varvec{\beta }}\big ) + \log (\vartheta _2), \end{aligned}$$
using the density approximation to the likelihood (Lindsey et al. 1996). Here, \(\phi \) denotes the standard normal density, and \({\varvec{b}}'(y,{\varvec{x}})^\top {\varvec{\vartheta }}= \frac{\partial {\varvec{b}}(y,{\varvec{x}})^\top {\varvec{\vartheta }}}{\partial y} = \vartheta _2\).
Now that we have established TMs and the log-likelihood function to estimate their parameters, we also need a more general notion of the residuals to formulate a causal regularizer for a distributional anchor loss. Most importantly, these residuals have to fulfill the same requirements as least squares residuals in the linear \(L_2\) anchor loss. That is, they have to have zero expectation and a positive definite covariance matrix (Theorem 3 in Rothenhäusler et al. 2021). In the survival analysis literature, score residuals have received considerable attention, and fulfill the above requirements at least asymptotically (Lagakos 1981; Barlow and Prentice 1988; Therneau et al. 1990; Farrington 2000). We now define score residuals for the general class of transformation models.
Definition 2
(Score residuals) Let \((F_Z, {\varvec{b}}, {\hat{{\varvec{\vartheta }}}})\) be a TM with maximum-likelihood estimate \({\hat{{\varvec{\vartheta }}}}\). On the scale of the transformation function, add an additional intercept parameter \(-\alpha \), to arrive at the TM
$$\begin{aligned} \left( F_Z, \left( {\varvec{b}}^\top , 1\right) ^\top , \left( {\hat{{\varvec{\vartheta }}}}^\top , -\alpha \right) ^\top \right) \end{aligned}$$
with distribution function
$$\begin{aligned} F_{Y\vert {\varvec{x}}}(y\vert {\varvec{x}}) = F_Z\left( {\varvec{b}}(y,{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}} - \alpha \right) . \end{aligned}$$
Because \({\hat{{\varvec{\vartheta }}}}\) maximizes the likelihood, \(\alpha \) is constrained to 0. The score residual for a single datum \(y \in (\underline{y}, {\bar{y}}]\) is now defined as
$$\begin{aligned} r := \frac{\partial }{\partial \alpha } \ell ({\varvec{\vartheta }}, \alpha ; y, {\varvec{x}}) \bigg |_{{\hat{{\varvec{\vartheta }}}}, \alpha \equiv 0}, \end{aligned}$$
(5)
which can be understood as the score contribution of a single observation to test \(\alpha = 0\) for a covariate which is not included in the model. When viewed as a random variable, the vector of score residuals has mean zero asymptotically and its components are asymptotically uncorrelated (Farrington 2000).
The score residuals can be derived in closed form for a transformation model and observations under any form of uninformative censoring
$$\begin{aligned} r = {\left\{ \begin{array}{ll} - f_Z'\left( {\varvec{b}}(y,{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}}\right) \big /f_Z\left( {\varvec{b}}(y,{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}}\right) , &{}\text{ exact, } \\ - f_Z\left( {\varvec{b}}\left( \overline{y},{\varvec{x}}\right) ^\top {\hat{{\varvec{\vartheta }}}}\right) \big / F_Z\left( {\varvec{b}}\left( \overline{y},{\varvec{x}}\right) ^\top {\hat{{\varvec{\vartheta }}}}\right) , &{}\text{ left, }\\ f_Z\left( {\varvec{b}}(\underline{y},{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}}\right) \big / \left( 1 - F_Z\left( {\varvec{b}}(\underline{y},{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}}\right) \right) , &{}{} \text{ right, } \\ \left( f_Z\left( {\varvec{b}}(\underline{y},{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}}\right) - f_Z\left( {\varvec{b}}(\overline{y},{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}}\right) \right) \big / &{}{} \text{ interval. } \\ \quad \left( F_Z\left( {\varvec{b}}(\overline{y},{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}}\right) - F_Z\left( {\varvec{b}} (\underline{y},{\varvec{x}})^\top {\hat{{\varvec{\vartheta }}}}\right) \right) \end{array}\right. } \end{aligned}$$
(6)
Example 3
(Lm, cont’d) By including the addtitional intercept parameter in the normal linear model in Eq. (3), the score residuals are given by
$$\begin{aligned}&\frac{\partial }{\partial \alpha } \ell (\vartheta _1, \vartheta _2, {\varvec{\beta }}, \alpha ; y, {\varvec{x}}) \bigg |_{{\hat{\vartheta }}_1, {\hat{\vartheta }}_2, {\hat{{\varvec{\beta }}}}, \alpha \equiv 0}\\ {}&\quad =\frac{\partial }{\partial \alpha } \left\{ \log \phi \left( \vartheta _1 + \vartheta _2 y- {\varvec{x}}^\top {\varvec{\beta }}-\alpha \right) + \log (\vartheta _2) \right\} \bigg |_{{\hat{\vartheta }}_1, {\hat{\vartheta }}_2, {\hat{{\varvec{\beta }}}}, \alpha \equiv 0} \\ {}&\quad = {\hat{\vartheta }}_1 + {\hat{\vartheta }}_2 y- {\varvec{x}}^\top {\hat{{\varvec{\beta }}}} = \frac{y - {\hat{\beta }}_0 - {\varvec{x}}^\top \hat{{\tilde{{\varvec{\beta }}}}}}{{\hat{\sigma }}}. \end{aligned}$$
In this simple case the score residuals are equivalent to scaled least-square residuals, which underlines the more general nature of score residuals. In Sect. 3.1 and Appendix C, we give further examples and intuition on score residuals in non-linear and non-Gaussian settings.
We are now ready to cast transformation models into the framework of SEMs. Here, it is natural to view the response Y as a deterministic function of the transformed random variable \(Z\sim F_Z\), which is given by the inverse transformation function \(h^{-1}\) in the following definition.
Definition 3
(Structural equation transformation model) Let the conditional distribution of \(Y\vert {\varvec{X}}, {\varvec{H}}, {\varvec{A}}\) be given by the transformation model \(F_{Y\vert {\varvec{X}}, {\varvec{H}}, {\varvec{A}}} = F_Z\circ h\). The structural equation for the response is a deterministic function of \({\varvec{X}}\), \({\varvec{H}}\), \({\varvec{A}}\) and the exogenous \(Z\), which, by definition, is distributed according to \(F_Z\) and independent of \(({\varvec{X}}, {\varvec{H}}, {\varvec{A}})\). Relationships other than the transformation function are assumed to be linear. Taken together, the following SEM defines a (partially) linear structural equation transformation model
$$\begin{aligned} Y&\leftarrow g(Z, {\varvec{X}}, {\varvec{H}}, {\varvec{A}}) := h^{-1}(Z\vert {\varvec{X}}, {\varvec{H}}, {\varvec{A}}) \\ {\varvec{X}}&\leftarrow {\mathbf {B}}_{{\varvec{X}}{\varvec{X}}} {\varvec{X}}+ {\mathbf {B}}_{{\varvec{X}}{\varvec{H}}} {\varvec{H}}+ {\mathbf {M}}_{\varvec{X}}{\varvec{A}}+ {\varvec{\varepsilon }}_{\varvec{X}}\\ {\varvec{H}}&\leftarrow {\mathbf {B}}_{{\varvec{H}}{\varvec{H}}} {\varvec{H}}+ {\mathbf {M}}_{\varvec{H}}{\varvec{A}}+ {\varvec{\varepsilon }}_{\varvec{H}}\\ {\varvec{A}}&\leftarrow {\varvec{\varepsilon }}_{\varvec{A}}\\ Z&\sim F_Z, \end{aligned}$$
where \({\varvec{\varepsilon }}_{\varvec{X}}, {\varvec{\varepsilon }}_{\varvec{H}}, {\varvec{A}}, Z\) are jointly independent.
As always, the structural equations are defined to hold as statements in distribution. By Corollary 1 in Hothorn et al. (2018), the transformation function \(h\) and its inverse exist, are unique and monotone non-decreasing in \(Y\) and \(Z\), respectively. In contrast to the linear SEM in Eq. (1), the SEM in Definition 3 is set up involving a transformed response and a potentially non-linear inverse transformation g.
However, from the perspective of transformation models it is more natural to parametrize the transformation function \(h\) instead of its inverse, because parameters in linear TMs are readily interpretable on this scale. For the empirical evaluation of the proposed estimator, we set up the transformation function as
$$\begin{aligned} h(Y\vert {\varvec{X}}, {\varvec{H}}, {\varvec{A}}) = {\varvec{b}}(y)^\top {\varvec{\theta }}- {\varvec{\beta }}^\top {\varvec{X}}- {\mathbf {B}}_{Y{\varvec{H}}}{\varvec{H}}- {\mathbf {M}}_{Y}{\varvec{A}}. \end{aligned}$$
(7)
A graphical representation of the SEM in Definition 3 is shown in Fig. 3. The basis expansion \({\varvec{b}}(y)^\top {\varvec{\theta }}\) in Eq. (7) can be viewed as an intercept function, which fixes the overall shape of the transformation function. The remaining additive components of the transformation function, in turn, solely shift the transformation up- or downwards with the covariates. This may seem restrictive at first, however, all covariates influence not only the conditional mean, but all higher conditional moments of \(F_{Y\vert {\varvec{X}}, {\varvec{H}}, {\varvec{A}}}\). We do not display the possibility that some components of \({\varvec{X}}\) directly influence each other, and likewise for \({\varvec{H}}\). In fact, in the simulations in Sect. 4, the coefficients \({\mathbf {B}}_{{\varvec{X}}{\varvec{X}}} = {\mathbf {B}}_{{\varvec{H}}{\varvec{H}}} = 0\).
Next, we will present our main proposal on distributional anchor regression to achieve robust TMs with respect to perturbations on the anchor variables \({\varvec{A}}\).