An IPW estimator for mediation effects in hazard models: with an application to schooling, cognitive ability and mortality

Large differences in mortality rates across those with different levels of education are a well-established fact. Cognitive ability may be affected by education so that it becomes a mediating factor in the causal chain. In this paper, we estimate the


Introduction
Traditionally, causal mediation analysis has been formulated within the framework of linear structural models (Baron and Kenny 1986). These models are difficult to extend to inherently nonlinear duration outcomes such as the mixed proportional hazard model. Recent papers have placed causal mediation analysis within the counterfactual/potential outcomes framework (Imai et al. 2010a, b;Huber 2014;Vander Weele 2015) all assuming sequential unconfoundedness. Tchetgen Tchetgen (2013) also introduced a weighting method by for mediation analysis in a Cox proportional hazard model. His method implies estimating a regression model for the mediator conditional on the treatment and pre-treatment covariates, while our method is based on estimating the propensity score (with and without the mediator). In general, it is more difficult to formulate a suitable model for the mediator than for the propensity score.
Our outcome, the age at death is a duration variable and the mortality hazard rate, the instantaneous probability that an individual dies at a certain age conditional on surviving up to that age, is modelled. Accounting for right censoring, when the individual is only known to have survived up to the end of the observation window, and left-truncation, when only those individuals are observed who were alive at a certain time, are easy to handle in hazard models (Van den Berg 2001). A common way to accommodate the presence of observed characteristics is to specify a proportional hazard model, in which the hazard is the product of the baseline hazard, the age dependence, and a log-linear function of covariates. Neglecting confounding in inherently nonlinear models, such as proportional hazard models, leads to biased inference.
Propensity score methods are increasingly used to take account of confounding in observational studies, e.g. see Caliendo and Kopeinig (2008) for a survey. The advantage of the propensity score is that it enables us to summarize the many possible confounding covariates as a single score (Rosenbaum and Rubin 1983). With a duration outcome, right censoring makes inference of differences in means, as is standard in treatment analysis, unreliable. Propensity score methods for hazard models have been introduced for duration data that account for censoring, truncation and dynamic selection issues (Cole and Hernán 2004;Austin 2014). We apply inverse probability weighting (IPW) methods using the propensity score (Hirano et al. 2003), which belongs to the larger class of marginal structural models that account for time-varying confounders when estimating the effect of time-varying covariates (Robins et al. 2000).
Cognitive ability can be considered a principal source of education selection and an endowment that determines success at school. Then, intelligence precedes education in the causal path to health and mortality. However, cognitive ability, at least as measured by standard IQ-tests, is likely to change with the education attained. Recent research (Falch and Massih 2011;Banks and Mazzonna 2012;Schneeweis et al. 2014;Carlsson et al. 2015;Dahmann 2017) has shown that additional education improves cognitive ability. In that case, cognitive ability is a mediator in the causal path from education to health. Ideally, we would have continuous measurement of the (development) of cognitive ability over the life cycle, to account for both the selection and mediation of cognitive ability in the causal path from education to mortality. However, in our data, we only observe cognitive ability at late adolescence when measured intelligence can be either the result of the attained education or a proxy of early childhood intelligence which influences education choice. When cognitive ability is a mediator we can decompose the effect of education on mortality into an effect running through improvement of cognitive ability and an effect through other pathways. An effect of education through improvement of cognitive ability is likely if education raises cognitive ability that aids disease management and in seeking appropriate treatment where necessary. Other possible pathways from education to mortality emerge if higher education leads to improvement in socioeconomic status later in life, such as labour market signals, non-cognitive skills and peer effects, which influence health and mortality.
In our empirical analyses, we use administrative data on Dutch men who were examined for military service in the Netherlands between 1961 and 1965 after completing their secondary schooling. We followed 39,803 men selected from the national birth cohorts [1944][1945][1946][1947]. These examinations are based on yearly listings of all Dutch male citizens aged 18 years in the national population registers. The sampled examination records were linked by Statistics Netherlands to recent national death records (up till the end of 2015). The records include a standardized recording of demographic and socioeconomic characteristics such as education, father's occupation, religion, family size, and birth order, along with a standardized psychometric test battery. The educational level was classified in four categories: primary school, lower vocational education, lower secondary education, and intermediate vocational education, general secondary education, higher non-university and university education.
Under the assumption that cognitive ability is a mediator of the education effect on mortality we also extend the IPW methods to mediation analysis for a (mixed) proportional hazard (MPH) model, the common model for econometric duration analysis. The main methodological contribution of this paper is that we disentangle the total effect of a treatment on a duration into an effect that runs through the mediator and an effect through other pathways. We derive and implement an IPW estimator for such a decomposition of the total effects in MPH models. The estimator identifies causal mechanisms given that a sequential unconfoundedness condition holds. This is a strong assumption and nonrefutable. We therefore carry out a set of sensitivity analyses to quantify the robustness of our empirical findings to violation of the sequential ignorability assumption. We focus, in particular, on how the possibility of selection into education based on cognitive ability may influence our results.
The empirical results show that improving education has hardly any impact on the mortality rate when accounting for cognitive ability. Using the mediation method, we only find a significant effect of education on mortality running through cognitive ability, for the lowest education group that amounts to a 15% reduction in the mortality rate. For the highest education group, we find a significant effect of education on mortality through other pathways of 12%.

The mortality hazard rate
We seek to find the impact of education level on the mortality risk for the men in our sample of conscripts. However, mortality may be influenced by factors that also determine the education choice. This may render education a selective choice and makes it endogenous to mortality later in life. We follow a propensity score method to account for selection on observed characteristics and estimate the effect of education on the mortality rate. Figure 1 provides a graphical illustration of the relationship between cognitive ability, education and mortality later in life using a directed acyclic graph, where each arrow represents a causal path (Pearl 2000(Pearl , 2012. It states that early childhood characteristics X , such as parental background and family size, influence the education choice D, the unmeasured childhood (pre-age 18) factors, U 0 , and the cognitive ability at age 18, Q 18 . The latter is also influenced by other childhood factors, which may include early life cognitive ability, and the education followed up to age 18. In our data, we do not observe these childhood factors (U 0 ).
We define the treatment effect, of moving up one education level, in terms of a proportional change in the (mortality) hazard rate. First, we discuss the assumptions, common in the potential outcomes literature that uses propensity score methods, to identify the impact of education on the mortality risk. In Sect. 2.2, we extend this to decompose the effect of education on the mortality rate into an effect running through improvement in cognitive ability and an effect running through other pathways. The main difference with standard propensity score methods is that we use potential hazard rates, the hazard rate that would be observed if the individual was untreated, λ(t|0), or treated λ(t|1).L e tD i = 1 be the treatment, moving up one education level. We observe pre-treatment (educational level) covariates X that influence the education choice.
Assumption 1 (Unconfoundedness) λ(t|d)⊥ D|X for d = 0, 1 where ⊥ denotes independence. The unconfoundedness assumption (Rubin 1974;Rosenbaum and Rubin 1983) asserts that, conditional on covariates X , treatment assignment (education level) is independent of the potential outcomes. This assumption requires that all variables that affect both the mortality and the education choice are observed. Note that this does not imply that we assume all relevant covariates are observed. Any missing factor is allowed to influence either the outcome or the education choice, not both. We check the robustness of our estimates to this, rather strong, unconfoundedness assumption by assessing to what extent the estimates are robust to violations of this assumption induced by including an additional simulated binary variable to capture unobservables (Nannicini 2007;Ichino et al. 2008).
The overlap, or common support assumption requires that the propensity score, the conditional probability to choose a higher education given covariates X , is bounded away from zero and one. In our data, we distinguish four (ordered) education levels in line with the contemporary Dutch education system (see Sect. 3). By comparing only adjacent education levels, we remove the overlap problems. Rosenbaum and Rubin (1983) show that if the potential outcomes are independent of treatment conditional on covariates X , they are also independent of treatment conditional on the propensity score, p(x) = Pr(D = 1|X = x). Hence if unconfoundedness holds, all biases due to observable covariates can be removed by conditioning on the propensity score (Imbens 2004). The average effects can be estimated by matching or weighting on the propensity score. Here, we use weighting on the propensity score. Inverse probability weighting based on the propensity score creates a pseudopopulation in which the education choice is independent of the measured confounders. The pseudo-population is the result of assigning to each individual a weight that is proportional to the inverse of their propensity score. Inverse probability weighting (IPW) estimation is usually based on normalized weights that add to unity.
In survival analysis, it is standard to compare the (nonparametric) Kaplan-Meier curves for the treated and the controls. The unadjusted survival curves may be misleading due to confounding. Cole and Hernán (2004) describe a method to estimate the IPW adjusted survival curves. Biostatisticians usually focus on Cox regression models and Cole and Hernán (2004) describe how Cox proportional hazard models can be weighted by the inverse propensity score to estimate causal effects of treatments. This method is related to the g-computation algorithm of Robins and Rotnitzky (1992) and Robins et al. (2000).
In economics the interest is often also in the duration dependence of the hazard. The Gompertz hazard, which assumes that the hazard increases exponentially with age, λ 0 (t) = e α 0 +α 1 t , is known to provide accurate mortality hazards (Gavrilov and Gavrilova 1991). However, it is hardly ever possible to include all relevant factors, either because the researcher does not know all the relevant factors or because it is not possible to measure then. Ignoring such unobserved heterogeneity or frailty may have a huge impact on inference in proportional hazard models, see e.g. Van den Berg (2001). A common solution is to use a Mixed Proportional Hazard (MPH) model, in which it is assumed that all unmeasured factors and measurement error can be captured in a multiplicative random term V . The hazard rate becomes The (random) frailty V > 0 is time-invariant and independent of the observed characteristics X and treatment D. Note that independence of V and D is crucial; otherwise, Assumption 1 would be violated. So, we assume that some factors influencing the mortality rate are not observed and that these factors do not influence the education choice. In the empirical application, it is assumed that V has a gamma distribution, a common assumption used in the empirical literature.
To adjust for confounding, we estimate a standard MPH model, that does not include the measured confounders as covariates, using the re-weighted pseudo-population. Fitting a (mixed) proportional hazard model in the pseudo-population is equivalent to fitting a weighted MPH model in the original sample. The parameters of such weighted where θ is the vector of parameters of the hazard in (2), Λ(t|·) = t 0 λ(s|·) ds,t h e integrated hazard and δ indicates whether the duration for individual i is censored δ i = 0 or not. 1

Mediation analysis for the mortality hazard rate
In this section, we discuss a model in which cognitive ability measured at age 18 mediates the impact of education on mortality. Mediation analysis aims to unravel the underlying causal mechanism into an effect running through changes of an intermediate variable, the mediator, and through other pathways. The counterfactual notation for average treatment effects can be extended to define causal mediation (see Huber 2014). We are particularly interested in the mediating effect of cognitive ability on mortality. It has been proven that high levels of cognitive ability is positively associated with high education (Ceci 1991;Hansen et al. 2004). Recent research (Falch and Massih 2011;Banks and Mazzonna 2012;Schneeweis et al. 2014;Carlssonetal.2015;Dahmann 2017) has shown that one additional year of education improves intelligence up to 0.3 standard deviations, both for the US and for some European countries. We use Q i to denote the observed cognitive ability (IQ-score), which is measured around age 18 when the men had their military examination and after they had completed secondary schooling. The mediation model we assume is illustrated by the DAG in Fig. 1.
Traditionally, causal mediation analysis has been formulated with the framework of linear structural models (Baron and Kenny 1986). Recent papers have placed causal mediation analysis within the counterfactual/potential outcomes framework (Imai et al. 2010a, b;Huber 2014). In the previous section, the potential outcome was solely a function of the treatment, e.g. education choice, but in mediation analysis the potential outcomes also depend on the mediator. Because cognitive ability can be affected by the education attained, 2 there exist two potential values, Q i (1) and Q i (0), only one of which will be observed, i.e. Q i = D i · Q i (1) + (1 − D i ) · Q i (0). For example, if individual i actually attained education level 1, we would observe Q i (1) but not Q i (0).Next,weuseλ i t|d, q(d) to denote the potential mortality hazard that would result from education equals d and cognitive ability equals q. For example, in the conscription data, λ i t|1, 110 represents the mortality hazard that would have been observed if individual i had education level 1 and a measured IQ-score of 110. As before, we only observe one of the multiple hazards λ i = λ i t|D i , Q i (D i ) .
Because we base our treatment effect on (mixed) proportional hazard models, it is again natural to define the mediator effects proportionally. Abbring and Berg (2003) also define, in a different setting with a dynamic treatment, a proportional treatment effect for a duration outcome. In other nonlinear settings, such as count data regression, a proportional treatment effect has been defined (Lee and Kobayashi 2001). We define the average effect of other pathways, depending on treatment status d: This framework enables us to disentangle the underlying causal pathway from education to mortality into an effect of education through improvement of cognitive ability and an effect through other pathways. We assume conditional independence (given X ) of the treatment and the mediator: Assumption 3 Sequential ignorablility: {λ(t|d ′ , q), Q(d)}⊥D|X and λ(t|d ′ , q) ⊥Q|D = d, X , ∀d, d ′ = 0, 1 and q in the support of Q.
The first condition of Assumption 3 implies that, conditional on observed covariates X , no unobserved confounder exists that jointly affects the education choice, the cognitive ability and the mortality. The second condition implies that, conditional on observed covariates X and the education attained, no unobserved confounder exists that jointly affects cognitive ability and mortality. This would imply that X explains all the variation in U 0 or that U 0 does not (directly) affect education, the dashed line in Fig. 1. (Huber 2014;Imaietal.2010a) make the same assumptions for identification of the direct and indirect effects in a linear model. Assumption 3 is a strong assumption and nonrefutable. We therefore carry out a set of sensitivity analyses to quantify the robustness of our empirical findings to violation of the sequential ignorability assumption based on an extension of the sensitivity analyses of Nannicini (2007) and Ichino et al. (2008). We focus, in particular, on how the possibility of selection into education based on cognitive ability may influence our results. We also have a common support restriction for the propensity score including the mediator.
In addition, we assume independent censoring 3 and a proportional mediator effect θ(d): Assumption 4 (Independent censoring) Censoring is, conditional on the treatment D, independent of the covariates X , the outcome T and the mediator Q.
This is equivalent to assuming that the effect of the treatment, D, is not moderated by the value of the mediator. Thus, we assume no interaction effect, D · Q, in the hazard. Note that Assumption 5 does not rule out an MPH model. It only assumes that the unobserved heterogeneity is independent of the treatment D (as before) and the mediator Q. This leads to the following identification theorem for the effect of a treatment on the hazard running through other pathways (holding the mediator constant): Theorem 1 (Identification of other pathways effect θ(d)) Under Assumptions 1-5, the other pathways effect is identified through a weighted MPH regression with weights: The 'total effect' of education on the mortality rate, from an IPW estimation in which the mediator is excluded from the propensity score, can be decomposed into an effect of education running through the mediator η(·) and an effect of education running through other pathways θ(·) using assumption 2: The effect running through other pathways (holding the mediator constant) can be estimated solving (3), using W (d) from (5) as weights. The effect running through the mediator can be obtained from the log-difference of the estimated total and the estimated effect running through other pathways, using (6)o r( 7). The first effect represents the effect of education on the mortality hazard while holding cognitive ability constant at the level that would have been realized for chosen education level d. The second effect represents the effect of education on mortality if one changes cognitive ability from the value that would have been realized for education level 0 to the value that would have been observed for education level 1, while holding the education level at level d.
For estimation, we use normalized versions of the sample implied by the weights in (5), such that the weights in either treatment or control groups add up to unity, as advocated earlier. We estimate the additional propensity scores conditional on the pre-treatment covariates and the mediator, Pr(D = 1|X i , Q i ), by probit specifications.
A nice feature of Theorem 1 is that it is straightforward to implement and only involves estimation of two propensity scores and plugging them into standard mixed proportional hazard estimation. No parametric restriction is imposed on the model of the mediator. Tchetgen Tchetgen (2013) also defines mediation analysis in (Cox) proportional hazard models. His method, which is also based on proportional decomposition, sequential ignorability, independent censoring and a proportional mediator effect, implies estimating a regression model for the mediator conditional on the treatment and pre-treatment covariates f (Q|D, X ), while our method is based on estimating the propensity score (with and without the mediator). In general, it is more difficult to formulate a suitable model for the mediator than for the propensity score. Vander Weele (2011) also derived a mediation estimator for the Cox proportional hazards model. Although his method does not need assumption 5, a proportional mediator effect, it requires an additional assumption that the outcome is rare over the entire follow-up period.

3D a t a
Data from a large sample from the nationwide Dutch Military Service Conscription Register for the years 1961-1965 and male birth cohorts 1944-1947 are analysed. All men, except those living in psychiatric institutions or in nursing institutes for the blind or for the deaf-mute, were called to a military service induction exam. The majority attended the conscription examination around age 18. 4 We have information from the military examinations for 45,037 men. The data were described elsewhere, Ekamper et al. (2014), here we provide the main characteristics. These data were linked to the Dutch death register through to the end of 2015 using unique personal identification numbers. Follow-up status was incomplete (due to emigration and other right-censoring events) for 1316 (2.9%) and entirely unknown for 2626 (8.3%) men. 5 The latter were removed from the data. These data allow us to follow a large group of men from age 18 until age 68-72 or until death. At the military examination, a standardized recording of demographic and socioeconomic characteristics such as education, father's occupation, religion, family size, region of birth, and birth order is recorded. We exploit the information on education attained at age 18 and the age at death to investigate the mortality difference while accounting for other factors that influence both educational level and mortality.
The educational level is classified in four categories, 6 (Doornbos and Kromhout 1990): primary school (age 6-12 years); lower vocational education (2 years postprimary school); lower secondary education (4 years post-primary school); and higher education (intermediate vocational education, general secondary education, higher non-university and university education, i.e. at least 6 years post-primary school). For this study, we excluded partly institutionalized conscripts who had attended special schools for those with disabilities or learning difficulties and conscripts who had not completed 6 years of schooling. After exclusion of these 2608 conscripts, 39,803 men remain for analysis.
A standardized psychometric test battery is included: comprising Raven Progressive Matrices, a nonverbal untimed test that requires inductive reasoning about perceptual patterns, the Bennett Mechanical Comprehension test, and tests for Clerical Aptitude, Language Comprehension, Arithmetic and a Global comprehensive score, that combines all five tests. All tests were administered to over 95% of the population who were examined at induction. Scores for all tests were grouped in six levels from 1 (highest) to 6 (lowest). The test scores are highly correlated with Pearson's r values in the range of .63 to .76. Here, we only focus on the scores of the comprehensive test.
Selected demographic and socioeconomic characteristics at the time of military examinations by education level are given in Table 1. First born conscripts tend to have higher education. Father's occupation was classified into five categories: professional and managerial workers; clerical, self-employed and skilled workers; farmers; semiskilled workers including operators, process workers and shop assistants; and labourers and miners. Fathers with unknown occupations were classified separately. Education level is also strongly related to father's occupation; men with the highest education tend to have fathers in professional or managerial occupations. Religion was classified into five categories. The place of birth was categorized in six regions. The combined cognition measure is the Global comprehensive score. Not surprisingly, men with the highest education tend to do best on the comprehensive IQ-test. Our principal measure of health is mortality with ages of death ranging from 18 up to 68-72. The lowest education group has a 70% higher mortality.
The Kaplan-Meier survival curves for the four education categories are shown in Fig. 2 and reflect these mortality differences. Survival increases with the education level and the differences between the education levels increase with age. The curves differ significantly (χ 2 = 180.76 for a log-rank test with 3 degrees of freedom). In subgroup analyses, survival differences comparing adjacent education levels are also statistically significant (χ 2 = 54. 79, 9.97, 29.80). This mortality difference by education is not necessarily due to education per se. It could be that the higher cognitive ability of higher educated people causes the difference. For example, understanding a doctor's advice and adhering to complex treatments may be driven by cognitive ability rather than education. From Table 1, we have seen already that education and IQ are highly correlated. Figure 3 shows that survival also increases with IQ and the differences are statistically significant (χ 2 = 277.72 for a log-rank test with 5 degrees of freedom). For all, except the two lowest, adjacent IQ-levels the differences in the Kaplan-Meier survival curves are significant. Within each education level the Kaplan-Meier curves also differ significantly by IQ-level (not shown here). Next, we investigate the relationship between IQ and educational attainment. The IQ-scores are measured on a six-point ordinal scale. Comparing individuals on the extremes of the education level is not helpful as these individuals differ too much Table 2 Impact of education levels on the mortality rate using a Gompertz-gamma MPH model and its decomposition

Total effect
Other pathways Cognitive ability in many respects. We focus on adjacent education levels only and estimate separate ordered probit models for the IQ-score in relation to the highest education level in each pair and other observed individual characteristics. The results of ordered probit analyses reveal a strong association between education and IQ. 7 Table 2 presents the estimated effect on the mortality hazard of moving up one educational level and its decomposition. We conclude from these analyses that for the lower educated, with only primary education, and for the lower secondary educated obtaining more education reduces their mortality rate (around 25%). Moving from lower vocational education to lower secondary education only reduces the mortality rate by 9%. 8 The last four columns in Table 2 present the decomposition of the effects of education on the mortality rate. The effect of education through other pathways is only significant for the highest education group while holding cognitive ability at the level of those with high education and for the lowest education group while holding cognitive ability at the level of those with primary education. About two-thirds of the mortality reduction for men moving from lower secondary to higher education runs through other pathways, such as, for example, an increase in income. For the lowest education groups, the impact of education on mortality mainly runs through the increase in cognitive ability induced by the additional education. For these men, 90% of the reduction in mortality is explained by the effect running through cognitive ability. Table 3 Double robust estimation of the total effect of education on the mortality rate and its decomposition using an IPW Gompertz-gamma MPH

Total effect
Other pathways Cognitive ability The unadjusted robust estimator includes all the variables used for the propensity score as control variables in the Gompertz-gamma MPH model + p < 0.05; * * p < 0.01

Robustness checks
Throughout, we have assumed that the propensity scores are estimated consistently. Misspecification of the propensity score will generally produce bias. An approach to improve the robustness of the proposed methodology can be obtained using a doubly robust estimator which also includes a regression adjustment. Rotnitzky and Robins (1995) point out that if either the regression adjustment or the propensity score is correctly specified, the resulting estimator will be consistent. Thus, we also estimate doubly robust estimators of the models, including the observed characteristics and the IQ-test both in the propensity score and in the hazard regression, see Table 3. Including regression covariates hardly changes the IPW estimates (compare column 2 of Table 3 and of Table 2). Not surprisingly, including the covariates does change the 'unadjusted' results a little (compare column 1 of Table 3 and of Table 2). The individuals who were removed from the analysis, because their survival status is unknown, may be a selective sample, see Table 8 in "Appendix B". To account for possible sample selection bias, we estimated the propensity score of an individual being removed, using a probit model for each level of education separately. 9 Based on this probability of removal, we impose additional weighting of all observations in our estimation sample using the inverse of the probability of inclusion in the sample and we re-estimate the total effect and its decomposition. The results after imposing this additional weighting show very little difference from the original results, see Table 11 in "Appendix B".
Another issue is that childhood health problems may influence both education choice and mortality later in life. We perform a robustness analysis by adding health indicators to the educational propensity score. Our data are limited and only include health measurements at the military examination, so these can only be used to proxy childhood health. We used indicators for height < 170cm; height > 185cm; overweight(bmi > 25), poor general health; poor hearing; poor sight and poor psy- Table 4 Impact of education on the mortality rate and its decomposition using an IPW Gompertz-gamma MPH, including health at age 18 indicators in the propensity score

Total
Other pathways Cognitive ability chological assessment, and re-estimated the propensity scores, both without IQ, to estimate the total effect of education and, with IQ-measurements to decompose the total effect into an effect running through changes in cognitive ability and an effect running through other pathways. The estimated impact of education on mortality changes slightly when accounting for health problems, see Table 4, but only for the lowest education group.

Sensitivity analyses
The critical assumption in propensity score weighting is that of no selection on unobservables. To test the sensitivity of the estimates to the unconfoundedness assumption, we build on the sensitivity analyses of Nannicini (2007) and Ichino et al. (2008). We extend these analyses to the mixed proportional hazard model. The Ichino et al. (2008) sensitivity analysis assumes that the possible unobserved confounding factors can be summarized in a binary variable, U , and that the unconfoundedness assumption holds conditional on X and U , i.e. λ(t|0) ⊥ D|X, U . Given the values of the probabilities that characterize the distribution of U , we can simulate a value of the unobserved confounding factor for each individual and re-estimate the IPW-MPH. The probabilities of the distribution of U depend on the value of the treatment and the outcome. The Ichino et al. (2008) sensitivity analysis assumes that the potential outcomes are binary, but Nannicini (2007) shows how to extend this to continuous outcomes by imposing a binary transformation. In survival analysis, we have a natural binary transformation, the censoring indicator δ i = 1 if individual i is still alive at the end of the observation period. Then, the distribution of the unobserved binary confounding factor U can be characterized by specifying the probabilities in each of the four groups.
for i, j = 0, 1. A measure of how the different configurations of p ij , chosen to simulate U , translate into associations of U with the outcome is ω, the coefficient of U in a MPH model for the control group (D = 0) using U and X as covariates. Ichino et al. (2008) call this (exponentiated) coefficient the 'outcome effect'. A measure of the effect of U on the relative probability to be assigned to the treatment is ξ , with ξ the coefficient of U in a logit model on the treatment assignment (D = 1) using U and X as covariates. Ichino et al. (2008) call this (exponentiated) coefficient the 'selection effect'.
For identification of the mediation effects, we also impose sequential ignorability (Assumption 2). We therefore also assume that conditional on the binary (unobserved) factor the following two conditions hold (i) {λ(t|d ′ , m), Q(d)}⊥D|X, U and (ii) λ(t|d ′ , q)⊥Q|D = d, X, U for ∀d, d ′ = 0, 1 and q in the support of Q.An e w measure, the mediator effect, is ψ, the coefficient of U in an ordered logit model on the IQ-test values for the control group using U and X as covariates.
The probability values of the distribution for U are chosen so that they mimic the distribution for each included binary variable. For example, consider the probability that an individual in the lowest education group (primary and lower vocational education) is catholic. Then, p 00 is this probability for catholics with primary education who died before the end of the observation period, p 01 is the probability for catholics with primary education who survived till the end, p 10 is the probability for catholics with lower vocational education who died before the end, and p 11 is the probability for catholics with lower vocational education who survived till the end. For each probability configuration of U , we repeat the simulation of U , the estimation of the outcome effect, the selection effect and the IPW-MPH treatment effects M = 100 times and obtain the average of these 100 simulations. The total variance of these averages can be estimated from (see Ichino et al. 2008): where f ∈{ ω, ξ} of each pairwise education comparison,f m is the estimated f in each simulation sample m and s 2 m is its estimated variance. Next, we re-estimate the total effect of education on mortality using an IPW Gompertz-gamma MPH model including U in the propensity score and the decomposition of the effect using an IPW Gompertz-gamma MPH including U and the IQ-measurements in the propensity score.
An issue with our empirical application is that early childhood IQ (one of the possible factors of U 0 in Fig. 1) might be a selection variable, explaining selection into education (rather than a mediation variable). 10 We, therefore, focus on the results of the sensitivity analysis when assuming U mimics the observed distribution of the IQmeasurements, i.e. the observed education choice and censoring probability are equal to the observed education choice and censoring prevalence for individuals with a given IQ level. We find the largest outcome, selection and mediation effects when the distri-bution of U mimics the impact of IQ on education and censoring. 11 Table 5 reports the simulated total effect and its decomposition into an effect running through cognitive ability and an effect running through other pathways including U in the IPW that mimics the distribution of the education choice and mortality for each IQ-level. 12 We find the largest changes in our IPW estimates when U mimics the education-mortality distribution of those with the highest IQ-level. These differences are, however, not statistically significant.
Next, we search for the existence of 'killer'-confounders, i.e. the existence of a set of probabilities p ij such that if U were observed, the estimated effects would be driven to zero. The reason for doing this is to assess the plausibility of the resulting configuration of U and how comparable this is to the distribution of observed confounders. In order to reduce the dimensionality of the characterization of the 'killer'-confounders we follow the suggestion of Nannicini (2007) and fix the probability of Pr(U = 1) to 0.4 and the difference p 11 − p 10 to zero. Now, the simulated confounders U can be fully described by two differences d = p 01 − p 00 and s = p 1. − p 0. , with p i. = Pr(U = 1|D = i) = p i1 ·Pr(δ 1 = 1|D = i)+ p i0 ·Pr(δ 1 = 0|D = i) for i = 0, 1, the fraction of individuals with U = 1 by education level. Nannicini (2007) argues that d is an (inconsistent) measure of the effect of U on the outcome (mortality, censoring probability) for the untreated (lower education level), while s is an (inconsistent) measure of the selection into treatment (higher education level). Both d and s are inconsistent measures because they do not account for the association between U and W , while our outcome effect, Ω, selection effects, ξ and mediation effects ψ, account for this. Table 6 reports the simulated total effect and its decomposition when the distribution of U is defined by d, s with d, s = 0.1,...,0.5. 13 Indeed, by using these 'killer'confounders we do find some large deviations from the original results for the impact of moving from primary to lower vocational education, while the estimates for higher levels of education remain remarkably stable. However, these differences apply for combinations of d and s that lie well away from the values implied by our observed confounders. Note that the largest values for d and s we found when the distribution of U mimics the education-censoring distribution of the observed variables was d = 0.03 and s = 0.06 when using the education-censoring distribution of the highest IQ-level.

Implied gain in life expectancy
From the Gompertz-hazards, we can estimate the median survival age of the recruits and their post-18 life expectancy. The median survival age is the age at which half of the people have died (conditional on survival up to age 18). Assuming that the estimated Gompertz hazard holds, the life expectancy at age t 0 = 18 can be very well approximated by (see Lenart 2014): 11 The results can be found in Table 12 in "Appendix B". 12 The results when U is based on the distribution of eduction choice and mortality for the other included variables can be found in Tables 13 and 14 in "Appendix B". 13 The simulated outcome, selection and mediation effects can be found in Table 15 in "Appendix B". Based on adding U to propensity score with probabilities of U from observed probabilities for each IQ-value + p < 0.05; * * p < 0.01 LE(t 0 ) =−exp e α 0 +α 1 t 0 (α 0 − ln(α 1 ) + α 1 t 0 + 0.5772)/α 1 where 0.5772 is the Euler constant. For the unadjusted Gompertz model, the estimated remaining life expectancies are 59.8 (primary); 62.6 (lower vocational); 63.7 (lower secondary) (64.2 based on last two education groups); and 66.7 (higher), leading to educational gains of 2.8, 1.0 and 2.5 in life expectancy. The median survival ages are 80.1 (primary); 82.9 (lower vocational); 84.1 (lower secondary) (84.6) and 87.1 (higher), thus leading to the same educational gains. In Table 7, we report the gains in life expectancy. The lower panel of Table 7 reports the gains in life expectancy based on the mediation analysis and decomposes the effects of education into an effect running through cognitive ability and an effect running through other pathways. Based on the IPW estimates, we can conclude that if an individual had improved his education from primary to lower vocational he would have gained 2.5 additional years (and his median age also would have improved by 2.5 years), of which 1.8 years are attributable to cognitive ability and 0.7 years to other changes induced by other pathways. If an individual had improved from lower vocational to lower secondary, the gain in life expectancy is 1.0 year (1.1 attributable to cognitive ability and an negative impact of other pathways). The gain in life expectancy if an individual had improved his education from lower secondary to higher education is 2.2 years. For those who attained higher education, this gain in life expectancy is mainly attributable to the other pathways (1.2 years), while for those with lower     Based on adding U to propensity score under the assumption that Pr(U = 1) = 0.4and p 11 − p 10 = 0, the differences d = p 01 − p 00 and s = p 1. − p 0. . + p < 0.05; * * p < 0.01

Discussion
A large literature documents that higher levels of education are positively associated with a longer life. Possible mechanisms include occupational risks, health behaviour, the ability to process information and cognitive ability (Cutler and Lleras-Muney 2008). It is commonly acknowledged that education and cognitive ability are correlated. Cognitive ability may cause differences in educational outcomes, or education may cause differences in cognitive ability. Most of the economics literature on the causal effect of education on health focuses on accounting for endogenous selection into education due to confounding factors, such as cognitive ability, either by exploiting natural experiments in education due to changes in compulsory schooling laws (Mazumder 2012) or by defining a structural model (Conti et al. 2010;Bijwaard et al. 2015). The estimates based on natural experiments find little to no effect of education on health, while the studies based on structural models find that around half of the difference in health by education is due to selection. An alternative perspective is that cognitive ability is part of the causal pathway from education to mortality. For instance, it has been proven that high scores on intelligence tests are positively associated with schooling level (Ceci 1991;Hansen et al. 2004;Carlsson et al. 2015). We assume that IQ measured at age 18 is affected by educational attainment and has a mediating effect on the mortality difference across education groups. We developed an inverse probability weighting (IPW) method for hazard models to estimate the impact of education on the mortality rate. We use conscription data of Dutch men born between 1944 and 1947 who were examined for military service between 1961 and 1965, and linked to national death records, in which we identified four education groups. Using the IPW methods, we estimate, for each adjacent education group, the impact of improving education on the mortality risk. We decompose the impact of education into an effect running through cognitive ability and an effect running through other pathways.
The results show that controlling for cognitive ability, as a mediating factor, leaves only limited evidence of an educational gain in mortality. When accounting for cognitive ability as a mediator in the causal pathway from education to mortality, we find that cognitive ability plays an important role in explaining the educational gradient. For men with primary school, we find that the effect of education running through an increase in cognitive ability significantly reduces the mortality risk (about 15% reduction in the mortality rate), which is equivalent to 1.6 years longer life expectancy. For the middle group, with lower vocational education, cognitive ability explains most of the educational gradient of a 10% reduction in the mortality rate. For the highest education group, only the effect of education running through other pathways, such as income effects of education, is significant (about 13% reduction in the mortality rate), leading to 1.3 additional years of life.
A limitation of our data, based on military entrance examination, is that we only observe men and no information on women is available. Bijwaard et al. (2015) found that educational gains for women appear to be higher than for men, in spite of the higher survival difference of women with lower versus higher education. These findings are based on much smaller numbers than the current study, however, and therefore need to be interpreted with caution. Another issue is that in the 1960s a major change occurred in the education system in the Netherlands and some of the specific education strata in this study no longer exist. In addition, the percentage of people with more than 6 years of post-primary school education is currently much higher compared to the past. These changes are not likely to affect our general conclusion that increased education only has a small effect on survival, but further long term studies will be needed to quantify these effects for contemporary school types. The issue of reverse causality that early childhood health affects educational attainment might distort our analyses (Case et al. 2005;Currie2009). We have limited information about childhood health status. However, adding health measurements from the military examination to the propensity score did not change our conclusion that education has only limited impact on mortality.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Appendices Appendix A: Counting processes and proofs
To prove the consistency and the properties of our estimation strategy, we rely on counting process theory for duration models. In a Mixed Proportional Hazard (MPH) model, the waiting time to some event T has a conditional distribution given observed X , treatment D, mediator Q and unobserved heterogeneity V with hazard rate The cdf and pdf of the distribution of the duration T can be expressed as functions of the hazard rate. The counting process approach has increasingly become the standard framework for analysing duration data, and Andersen et al. (1993) have provided an excellent survey of this literature. Less technical surveys have been given by Klein and Moeschberger (1997), Therneau and Grambsch (2000) and Aalen et al. (2009). The main advantage of this framework is that it allows us to express the duration distribution as a regression model with an error term that is a martingale difference. Regression models with martingale difference errors are the basis for inference in timeseries models with dependent observations. Hence, it is not surprising that inference is much simplified by using a similar representation in duration models.
To start the discussion, we first introduce some notation. A counting process {N (t); t ≥ 0} is a stochastic process describing the number of events in the interval [0, t] as time proceeds. The process contains only jumps of size +1. For single duration data, the event can only occur once, because the units are observed until the event occurs. Therefore, we introduce the observation indicator Y (t) = I (T ≥ t) that equal to 1 if the unit is under observation at time t and zero after the event has occurred. The counting process is governed by its random intensity process Y (t)λ(t), with λ(t) is the hazard in (A.1). If we consider a small interval (t − dt, t] of length dt, then Y (t)λ(t) is the conditional probability that the increment dN (t) = N (t) − N (t − dt) jumps in that interval given all that has happened until just before t. By specifying the intensity as the product of this observation indicator and the hazard rate, we effectively limit the number of occurrences of the event to one. It is essential that the observation indicator only depends on events up to time t.
Usually, some of the observations are right-censoredT = min(T, C r ). By defining the observation indicator as the product of the indicator I (t ≤ T ) and, if necessary, an indicator of the observation plan, we capture when a unit is at risk for the event. A related concept is left-truncation. Left truncation occurs when individuals are only observed conditional on survival up till some duration C l , the age of military examination in our application. In the case of right censoring and left-truncation, the at-risk indicator: Y (t) = I (t ≤ T )I (t ≤ C r )I (t ≥ C l ). We assume that C r , C l and T are conditionally independent given X . The history up to t, Y (t) is assumed to be a left continuous function of t. The history of the whole process also includes the (history of the) the covariates, treatment and mediator. Thus , which can be considered as a regression model with an additive error that is a martingale difference. This equation resembles a time-series regression model. The Doob-Meier decomposition is the key to the derivation of the distribution of the estimators, because the asymptotic behaviour of partial sums of martingales is well known. We begin with a proof of the unbiased of the inverse-probability-weighted Gompertz PH-model estimator given in Eq. (3). This applies for the model in which the hazard function does not include a mediator Q.

Proof of Eq. (3): IPW Gompertz is unbiased
In a parametric PH model, the log-likelihood in counting process notation is (Andersen and Borgan 1985): where λ 0 (t; α) is the baseline hazard with parameters α, e.g. for a Gompertz baseline hazard λ 0 (t; α) = e α 0 +α 1 t . Standard maximum likelihood estimation solves the roots of the derivatives of the log-likelihood: with θ = (α, γ ) ′ and L α (θ ) and L γ (θ ) are the gradients of the log-likelihood w.r.t. α and γ . The IPW version includes the weights W in equation (A.7) and (A.8). Because our main parameter of interest is γ , we only focus on L γ (θ ). First, we derive E WDdN (t) . Redefine the propensity score p(d) = Pr(D i = d|X i ), with d = 0, 1. Note that the integral of the sum is equal to the sum of the integrals. First, we derive E[WDdN (t)] and E[ Y (t)λ 0 (t; α)e γ D WD].
Thus, if we assume the right parametric model this implies that L γ (θ ) has zero mean.

Proof of Eq. (3): IPW Gamma-Gompertz is unbiased
In a MPH model with a parametric baseline hazard and a unit-mean Gamma-distributed unobserved heterogeneity with variance σ 2 , the (unconditional) hazard is: λ(t|D) = λ 0 (t; α)e γ D 1 + σ 2 t 0 λ 0 (s; α)e γ D ds and the likelihood (in counting process notation) is: IPW solves the roots of the weighted derivatives of the log-likelihood. The weighted derivative w.r.t. γ is: To prove (3), we use similar reasoning as above. First, we derive Thus, if we assume the right parametric model for the baseline hazard and a Gamma distribution for the unobserved heterogeneity, (A.13) has mean zero. 15 We now turn our attention to specifications that include a mediator Q and provide a proof for Theorem 1 on the identification of the decomposition.