The mortality hazard rate
We seek to find the impact of education level on the mortality risk for the men in our sample of conscripts. However, mortality may be influenced by factors that also determine the education choice. This may render education a selective choice and makes it endogenous to mortality later in life. We follow a propensity score method to account for selection on observed characteristics and estimate the effect of education on the mortality rate. Figure 1 provides a graphical illustration of the relationship between cognitive ability, education and mortality later in life using a directed acyclic graph, where each arrow represents a causal path (Pearl 2000, 2012). It states that early childhood characteristics X, such as parental background and family size, influence the education choice D, the unmeasured childhood (pre-age 18) factors, \(U_0\), and the cognitive ability at age 18, \(Q_{18}\). The latter is also influenced by other childhood factors, which may include early life cognitive ability, and the education followed up to age 18. In our data, we do not observe these childhood factors (\(U_0\)).
We define the treatment effect, of moving up one education level, in terms of a proportional change in the (mortality) hazard rate. First, we discuss the assumptions, common in the potential outcomes literature that uses propensity score methods, to identify the impact of education on the mortality risk. In Sect. 2.2, we extend this to decompose the effect of education on the mortality rate into an effect running through improvement in cognitive ability and an effect running through other pathways. The main difference with standard propensity score methods is that we use potential hazard rates, the hazard rate that would be observed if the individual was untreated, \(\lambda (t|0)\), or treated \(\lambda (t|1)\). Let \(D_i =1\) be the treatment, moving up one education level. We observe pre-treatment (educational level) covariates X that influence the education choice.
Assumption 1
(Unconfoundedness) \(\lambda (t|d) \bot \; D|X\) for \(d=0,1\)
where \(\bot \) denotes independence. The unconfoundedness assumption (Rubin 1974; Rosenbaum and Rubin 1983) asserts that, conditional on covariates X, treatment assignment (education level) is independent of the potential outcomes. This assumption requires that all variables that affect both the mortality and the education choice are observed. Note that this does not imply that we assume all relevant covariates are observed. Any missing factor is allowed to influence either the outcome or the education choice, not both. We check the robustness of our estimates to this, rather strong, unconfoundedness assumption by assessing to what extent the estimates are robust to violations of this assumption induced by including an additional simulated binary variable to capture unobservables (Nannicini 2007; Ichino et al. 2008).
The overlap, or common support assumption requires that the propensity score, the conditional probability to choose a higher education given covariates X, is bounded away from zero and one. In our data, we distinguish four (ordered) education levels in line with the contemporary Dutch education system (see Sect. 3). By comparing only adjacent education levels, we remove the overlap problems.
Rosenbaum and Rubin (1983) show that if the potential outcomes are independent of treatment conditional on covariates X, they are also independent of treatment conditional on the propensity score, \(p(x)=\Pr (D=1|X=x)\). Hence if unconfoundedness holds, all biases due to observable covariates can be removed by conditioning on the propensity score (Imbens 2004). The average effects can be estimated by matching or weighting on the propensity score. Here, we use weighting on the propensity score. Inverse probability weighting based on the propensity score creates a pseudo-population in which the education choice is independent of the measured confounders. The pseudo-population is the result of assigning to each individual a weight that is proportional to the inverse of their propensity score. Inverse probability weighting (IPW) estimation is usually based on normalized weights that add to unity.
$$\begin{aligned} W_{i} = \biggl [\frac{D_i}{\hat{p}(X_i)}\Biggr / \sum _{j=1}^n \frac{D_j}{\hat{p}(X_j)}\biggr ] + \biggl [\frac{(1-D_i)}{1-\hat{p}(X_i)}\Biggr / \sum _{j=1}^n \frac{1-D_j}{1-\hat{p}(X_j)}\biggr ] \end{aligned}$$
(1)
In survival analysis, it is standard to compare the (nonparametric) Kaplan–Meier curves for the treated and the controls. The unadjusted survival curves may be misleading due to confounding. Cole and Hernán (2004) describe a method to estimate the IPW adjusted survival curves. Biostatisticians usually focus on Cox regression models and Cole and Hernán (2004) describe how Cox proportional hazard models can be weighted by the inverse propensity score to estimate causal effects of treatments. This method is related to the g-computation algorithm of Robins and Rotnitzky (1992) and Robins et al. (2000).
In economics the interest is often also in the duration dependence of the hazard. The Gompertz hazard, which assumes that the hazard increases exponentially with age, \(\lambda _0(t) = e^{\alpha _0 + \alpha _1 t}\), is known to provide accurate mortality hazards (Gavrilov and Gavrilova 1991). However, it is hardly ever possible to include all relevant factors, either because the researcher does not know all the relevant factors or because it is not possible to measure then. Ignoring such unobserved heterogeneity or frailty may have a huge impact on inference in proportional hazard models, see e.g. Van den Berg (2001). A common solution is to use a Mixed Proportional Hazard (MPH) model, in which it is assumed that all unmeasured factors and measurement error can be captured in a multiplicative random term V. The hazard rate becomes
$$\begin{aligned} \lambda (t|D,V) = V\lambda _{0}(t) \exp (\gamma D), \end{aligned}$$
(2)
The (random) frailty \(V>0\) is time-invariant and independent of the observed characteristics X and treatment D. Note that independence of V and D is crucial; otherwise, Assumption 1 would be violated. So, we assume that some factors influencing the mortality rate are not observed and that these factors do not influence the education choice. In the empirical application, it is assumed that V has a gamma distribution, a common assumption used in the empirical literature.
To adjust for confounding, we estimate a standard MPH model, that does not include the measured confounders as covariates, using the re-weighted pseudo-population. Fitting a (mixed) proportional hazard model in the pseudo-population is equivalent to fitting a weighted MPH model in the original sample. The parameters of such weighted MPH models can be used to estimate the causal effects of education on mortality in the original sample. The IPW estimator in the (M)PH model is equivalent to solving the weighted derivatives of the log-likelihood:
$$\begin{aligned} L(\theta ) = \sum _{i=1}^N W_i \biggl [ \delta _i\frac{\partial \log \lambda (t_i|\cdot )}{\partial \theta } - \frac{\partial \varLambda (t_i|\cdot )}{\partial \theta }\biggr ] \end{aligned}$$
(3)
where \(\theta \) is the vector of parameters of the hazard in (2), \(\varLambda (t|\cdot )= \int _0^t \lambda (s|\cdot )\, \mathrm{d}s\), the integrated hazard and \(\delta \) indicates whether the duration for individual i is censored \(\delta _i=0\) or not.Footnote 1
Mediation analysis for the mortality hazard rate
In this section, we discuss a model in which cognitive ability measured at age 18 mediates the impact of education on mortality. Mediation analysis aims to unravel the underlying causal mechanism into an effect running through changes of an intermediate variable, the mediator, and through other pathways. The counterfactual notation for average treatment effects can be extended to define causal mediation (see Huber 2014). We are particularly interested in the mediating effect of cognitive ability on mortality. It has been proven that high levels of cognitive ability is positively associated with high education (Ceci 1991; Hansen et al. 2004). Recent research (Falch and Massih 2011; Banks and Mazzonna 2012; Schneeweis et al. 2014; Carlsson et al. 2015; Dahmann 2017) has shown that one additional year of education improves intelligence up to 0.3 standard deviations, both for the US and for some European countries. We use \(Q_i\) to denote the observed cognitive ability (IQ-score), which is measured around age 18 when the men had their military examination and after they had completed secondary schooling. The mediation model we assume is illustrated by the DAG in Fig. 1.
Traditionally, causal mediation analysis has been formulated with the framework of linear structural models (Baron and Kenny 1986). Recent papers have placed causal mediation analysis within the counterfactual/potential outcomes framework (Imai et al. 2010a, b; Huber 2014). In the previous section, the potential outcome was solely a function of the treatment, e.g. education choice, but in mediation analysis the potential outcomes also depend on the mediator. Because cognitive ability can be affected by the education attained,Footnote 2 there exist two potential values, \(Q_i(1)\) and \(Q_i(0)\), only one of which will be observed, i.e. \(Q_i=D_i\cdot Q_i(1) + (1-D_i)\cdot Q_i(0)\). For example, if individual i actually attained education level 1, we would observe \(Q_i(1)\) but not \(Q_i(0)\). Next, we use \(\lambda _i\bigl (t|d, q(d)\bigr )\) to denote the potential mortality hazard that would result from education equals d and cognitive ability equals q. For example, in the conscription data, \(\lambda _i\bigl (t|1, 110\bigr )\) represents the mortality hazard that would have been observed if individual i had education level 1 and a measured IQ-score of 110. As before, we only observe one of the multiple hazards \(\lambda _i=\lambda _i\bigl (t|D_i, Q_i(D_i)\bigr )\).
Because we base our treatment effect on (mixed) proportional hazard models, it is again natural to define the mediator effects proportionally. Abbring and Berg (2003) also define, in a different setting with a dynamic treatment, a proportional treatment effect for a duration outcome. In other nonlinear settings, such as count data regression, a proportional treatment effect has been defined (Lee and Kobayashi 2001). We define the average effect of other pathways, depending on treatment status d:
Assumption 2
Proportional decomposition
$$\begin{aligned} \theta (d) = \frac{\mathrm {E}\Bigl [\lambda \bigl (t|1, Q(d)\bigr )\Bigr ]}{\mathrm {E}\Bigl [\lambda \bigl (t|0, Q(d)\bigr )\Bigr ]} \end{aligned}$$
(4)
This framework enables us to disentangle the underlying causal pathway from education to mortality into an effect of education through improvement of cognitive ability and an effect through other pathways. We assume conditional independence (given X) of the treatment and the mediator:
Assumption 3
Sequential ignorablility:\(\{\lambda (t|d',q), Q(d)\} \bot D|X\) and \(\lambda (t|d',q) \bot Q|D=d,X\), \(\forall d, d' =0,1\) and q in the support of Q.
The first condition of Assumption 3 implies that, conditional on observed covariates X, no unobserved confounder exists that jointly affects the education choice, the cognitive ability and the mortality. The second condition implies that, conditional on observed covariates X and the education attained, no unobserved confounder exists that jointly affects cognitive ability and mortality. This would imply that X explains all the variation in \(U_0\) or that \(U_0\) does not (directly) affect education, the dashed line in Fig. 1. (Huber 2014; Imai et al. 2010a) make the same assumptions for identification of the direct and indirect effects in a linear model. Assumption 3 is a strong assumption and nonrefutable. We therefore carry out a set of sensitivity analyses to quantify the robustness of our empirical findings to violation of the sequential ignorability assumption based on an extension of the sensitivity analyses of Nannicini (2007) and Ichino et al. (2008). We focus, in particular, on how the possibility of selection into education based on cognitive ability may influence our results. We also have a common support restriction for the propensity score including the mediator.
In addition, we assume independent censoringFootnote 3 and a proportional mediator effect \(\theta (d)\):
Assumption 4
(Independent censoring) Censoring is, conditional on the treatment D, independent of the covariates X, the outcome T and the mediator Q.
Assumption 5
(Proportional mediator effect) \(\lambda \bigl (t|1,Q(d)\bigr ) = e^{\theta (d)}\lambda \bigl (t|0,Q(d)\bigr )\).
This is equivalent to assuming that the effect of the treatment, D, is not moderated by the value of the mediator. Thus, we assume no interaction effect, \(D\cdot Q\), in the hazard. Note that Assumption 5 does not rule out an MPH model. It only assumes that the unobserved heterogeneity is independent of the treatment D (as before) and the mediator Q. This leads to the following identification theorem for the effect of a treatment on the hazard running through other pathways (holding the mediator constant):
Theorem 1
(Identification of other pathways effect \(\theta (d)\)) Under Assumptions 1–5, the other pathways effect is identified through a weighted MPH regression with weights:
$$\begin{aligned} W(d) = \frac{\Pr (D=d|Q,X)}{\Pr (D=d|X)}\biggl (\frac{D}{\Pr (D=1|Q,X)}+\frac{1-D}{\Pr (D=0|Q,X)}\biggr ) \end{aligned}$$
(5)
with weight W(d) for \(\theta (d)\), for \(d=0,1\).
(See Appendix A for the proof.)
The ‘total effect’ of education on the mortality rate, from an IPW estimation in which the mediator is excluded from the propensity score, can be decomposed into an effect of education running through the mediator \(\eta (\cdot )\) and an effect of education running through other pathways \(\theta (\cdot )\) using assumption 2:
$$\begin{aligned} \frac{\lambda \bigl (t|D=1, Q(1)\bigr )}{\lambda \bigl (t|D=0, Q(0)\bigr )}= & {} \frac{\lambda \bigl (t|D=1, Q(1)\bigr )}{\lambda \bigl (t|D=0, Q(1)\bigr )} \cdot \frac{\lambda \bigl (t|D=0, Q(1)\bigr )}{\lambda \bigl (t|D=0, Q(0)\bigr )} = \exp \Bigl (\theta (1) +\eta (0) \Bigr )\nonumber \\\end{aligned}$$
(6)
$$\begin{aligned}= & {} \frac{\lambda \bigl (t|D=1, Q(1)\bigr )}{\lambda \bigl (t|D=1, Q(0)\bigr )} \cdot \frac{\lambda \bigl (t|D=1, Q(0)\bigr )}{\lambda \bigl (t|D=0, Q(0)\bigr )} = \exp \Bigl ( \eta (1)+ \theta (0) \Bigr )\nonumber \\ \end{aligned}$$
(7)
The effect running through other pathways (holding the mediator constant) can be estimated solving (3), using W(d) from (5) as weights. The effect running through the mediator can be obtained from the log-difference of the estimated total and the estimated effect running through other pathways, using (6) or (7). The first effect represents the effect of education on the mortality hazard while holding cognitive ability constant at the level that would have been realized for chosen education level d. The second effect represents the effect of education on mortality if one changes cognitive ability from the value that would have been realized for education level 0 to the value that would have been observed for education level 1, while holding the education level at level d.
For estimation, we use normalized versions of the sample implied by the weights in (5), such that the weights in either treatment or control groups add up to unity, as advocated earlier. We estimate the additional propensity scores conditional on the pre-treatment covariates and the mediator, \(\Pr (D=1|X_i,Q_i)\), by probit specifications.
A nice feature of Theorem 1 is that it is straightforward to implement and only involves estimation of two propensity scores and plugging them into standard mixed proportional hazard estimation. No parametric restriction is imposed on the model of the mediator. Tchetgen Tchetgen (2013) also defines mediation analysis in (Cox) proportional hazard models. His method, which is also based on proportional decomposition, sequential ignorability, independent censoring and a proportional mediator effect, implies estimating a regression model for the mediator conditional on the treatment and pre-treatment covariates f(Q|D, X), while our method is based on estimating the propensity score (with and without the mediator). In general, it is more difficult to formulate a suitable model for the mediator than for the propensity score. Vander Weele (2011) also derived a mediation estimator for the Cox proportional hazards model. Although his method does not need assumption 5, a proportional mediator effect, it requires an additional assumption that the outcome is rare over the entire follow-up period.