Appendix 1 Counterfactual Inference
Here we derive Eq. 4, via Pearl’s counterfactual inference protocol involving three steps: abduction, action, and inference
[17]. Our model can be represented with the following structural equations over the graph structure in Fig. 2:
$$\begin{aligned} \mathsf {J}&:= \epsilon _{\mathsf {J}}, \quad \mathsf {Z}:= \epsilon _\mathsf {Z}, \quad \mathsf {X}:= \epsilon _\mathsf {X}, \quad \mathsf {T}:= g(\mathsf {H},\mathsf {X},\mathsf {Z},\epsilon _{\mathsf {T}}), \quad \mathsf {Y}:= f(\mathsf {T},\mathsf {X},\mathsf {Z},\epsilon _\mathsf {Y}). \end{aligned}$$
For any cases where \(\mathsf {T} =0\) in the data, we calculate the counterfactual value of \(\mathsf {Y} \) if we had \(\mathsf {T} =1\). We assume here that all these parameters, functions and distributions are known. In the abduction step we determine \(\mathbf {P}(\epsilon _\mathsf {H}, \epsilon _\mathsf {Z}, \epsilon _\mathsf {X}, \epsilon _{\mathsf {T}},\epsilon _\mathsf {Y} |j,x,\mathsf {T} =0)\), the distribution of the stochastic disturbance terms updated to take into account the observed evidence on the decision maker, observed features and the decision (given the decision \(\mathsf {T} =0\) disturbances are independent of \(\mathsf {Y} \)). We directly know \(\epsilon _\mathsf {X} =x \) and \(\epsilon _{_\mathsf {J}}=j \). Due to the special form of f the observed evidence is independent of \(\epsilon _\mathsf {Y} \) when \(\mathsf {T} = 0\). We only need to determine \(\mathbf {P}(\epsilon _\mathsf {Z},\epsilon _{\mathsf {T}}|h ,x,\mathsf {T} =0)\). Next, the action step involves intervening on \(\mathsf {T} \) and setting \(\mathsf {T} =1\) by intervention. Finally in the prediction step we estimate \(\mathsf {Y} \):
where we used \(\epsilon _\mathsf {Z} =z \) and integrated out \(\epsilon _\mathsf {T} \) and \(\epsilon _\mathsf {Y} \). This gives us the counterfactual expectation of Y for a single subject.
Appendix 2 On the Priors of the Bayesian Model
The priors for \(\gamma _\mathsf {X},~\beta _\mathsf {X},~\gamma _\mathsf {Z} \) and \(\beta _\mathsf {Z} \) were defined using the gamma-mixture representation of Student’s t-distribution with \(\nu =6\) degrees of freedom. The gamma-mixture is obtained by first sampling a precision parameter from \(\varGamma \)(
) and then drawing the coefficient from zero-mean Gaussian with that precision. This procedure was applied to the scale parameters \(\eta _\mathsf {Z},~\eta _{\beta _\mathsf {X}}\) and \(\eta _{\gamma _\mathsf {X}}\) as shown below. For vector-valued \(\mathsf {X}\), the components of \(\gamma _\mathsf {X} \) (\(\beta _\mathsf {X} \)) were sampled independently with a joint precision parameter \(\eta _{\gamma _\mathsf {X}}\) (\(\beta _{\gamma _\mathsf {X}}\)). The coefficients for the unobserved confounder \(\mathsf {Z}\) were bounded to the positive values to ensure identifiability.
$$\begin{aligned} \eta _\mathsf {Z}, \eta _{\beta _\mathsf {X}}, \eta _{\gamma _\mathsf {X}} \sim \varGamma (3, 3), \; \gamma _\mathsf {Z},\beta _\mathsf {Z} \sim N_+(0, \eta _\mathsf {Z} ^{-1}),\; \gamma _\mathsf {X} \sim N(0, \eta _{\gamma _\mathsf {X}}^{-1}),\; \beta _\mathsf {X} \sim N(0, \eta _{\beta _\mathsf {X}}^{-1}) \end{aligned}$$
The intercepts for the decision makers in the data and outcome \(\mathsf {Y}\) had hierarchical Gaussian priors with variances \(\sigma _\mathsf {T} ^2\) and \(\sigma _\mathsf {Y} ^2\). The decision makers had a joint variance parameter \(\sigma _\mathsf {T} ^2\).
$$\begin{aligned} \sigma _\mathsf {T} ^2,~\sigma _\mathsf {Y} ^2 \sim N_+(0, \tau ^2),\quad \alpha _j \sim N(0, \sigma _\mathsf {T} ^2),\quad \alpha _\mathsf {Y} \sim N(0, \sigma _\mathsf {Y} ^2) \end{aligned}$$
The parameters \(\sigma _\mathsf {T} ^2\) and \(\sigma _\mathsf {Y} ^2\) were drawn independently from Gaussian distributions with mean 0 and variance \(\tau ^2=1\), and restricted to the positive real axis.