Estimating heterogeneous policy impacts using causal machine learning: a case study of health insurance reform in Indonesia

Policymakers seeking to target health policies efficiently towards specific population groups need to know which individuals stand to benefit the most from each of these policies. While traditional approaches for subgroup analyses are constrained to only consider a small number of pre-defined subgroups, recently proposed causal machine learning (CML) approaches help explore treatment-effect heterogeneity in a more flexible yet principled way. Causal forests use a generalisation of the random forest algorithm to estimate heterogenous treatment effects both at the individual and the subgroup level. Our paper aims to explore this approach in the setting of health policy evaluation with strong observed confounding, applied specifically to the context of mothers’ health insurance enrolment in Indonesia. Comparing two health insurance schemes (subsidised and contributory) against no insurance, we find beneficial average impacts of enrolment in contributory health insurance on maternal health care utilisation and infant mortality, but no impact of subsidised health insurance. The causal forest algorithm identified significant heterogeneity in the impacts of contributory insurance, not just along socioeconomic variables that we pre-specified (indicating higher benefits for poorer, less educated, and rural women), but also according to some other characteristics not foreseen prior to the analysis, suggesting in particular important geographical impact heterogeneity. Our study demonstrates the power of CML approaches to uncover unexpected heterogeneity in policy impacts. The findings from our evaluation of past health insurance expansions can potentially guide the re-design of the eligibility criteria for subsidised health insurance in Indonesia.


Introduction
Policymakers around the world are implementing health system policies to promote access to essential health care and to meet the health-related Sustainable Development Goals (Sachs 2012). Under limited budgets, maximising the impact of these policies on population health and health inequalities ideally requires evidence that explicitly acknowledges, and captures, how policy impacts might vary within a given population-i.e. the potential treatment effect heterogeneity. A key focus for health policiesand for the present paper-is heterogeneity in terms of observed effect modifiers, i.e. measured covariates that can modify the causal effect of a policy. In the typical setting where health policy impact evaluation is based on observational data, some evidence about effect modifiers can be obtained from subgroup analyses, by comparing the effects of interventions across different population groups, characterised, for instance, by their socio-economic status (Mackenbach 2003). However, impact evaluations of health policies tend not to present such comparisons, due to concerns that subgroup analysis, unless pre-specified, may produce spurious findings (Petticrew et al. 2012). Even when a treatment effect heterogeneity estimation is implemented, it typically involves including ad-hoc interaction terms in the models, thus necessitating parametric assumptions that are unlikely to hold (Hainmueller and Mummolo 2019).
Machine learning (ML) approaches are increasingly proposed as a way to pre-empt the criticism of arbitrariness, by estimating treatment effect heterogeneity based on a systematic exploration of the data (VanderWeele et al. 2019). These methods build on the growing body of methodological literature of 'causal machine learning' where supervised machine learning is used to help estimate causal parameters of interest (van der Laan and Rose 2011; Chernozhukov et al. 2018a). Some approaches aim to identify groups of individuals that have benefited most and least in terms of treatment effects (e.g. Strauss 2011, Chernozkhukov et al. 2018b), while others aim to flexibly capture how treatment effects vary according to observed covariates, by estimating a so-called 'conditional treatment effect function' (CATE) in a data-adaptive manner (e.g. Kunzel et al. 2019, Athey et al. 2019a, Fan et al. 2020. What is common to these approaches is that they combine the flexibility of ML methods with the rigour of semiparametric statistical theory, resulting in valid inferences after data-adaptive estimation.
In this paper, we focus on a particular method, the so-called 'causal forests' , Athey et al. 2019a, Nie and Wager 2021, to estimate individual treatment effects which can be aggregated to provide average causal treatment effects estimates for subgroups of interest. The first major benefit of this approach compared to traditional methods is in that researchers do not need to specify subgroups for stratified analysis or impose parametric assumptions about interactions. The second advantage is that it provides statistical tests to assess whether there is significant heterogeneity in the treatment effects that is explained by observed covariates, along with offering an indication of the variables that are most strongly associated with this heterogeneity.
Previous studies have demonstrated the good performance of the causal forest estimator and its modifications in simulation studies (e.g. Lechner 2018, Knaus et al. 2021, Nie and Wager 2021, Fan et al. 2020. A growing number of studies have been using causal forests for programme evaluation, typically in labour economics (Davis andHeller 2017, Knaus et al. 2020), but also for the evaluation of health interventions that have been randomised (e.g. Scarpa et al. 2019). To our knowledge, the approach has not been applied in the setting of health policy evaluation where there is no randomisation and hence, statistical approaches are required to account for confounding.
In this paper, we demonstrate that the causal forests approach can provide beneficial information for health policymaking decisions aimed at improving overall health and reducing health inequalities. Specifically, we apply causal forests to explore effect heterogeneity in two types of public health insurance programmes in Indonesia: subsidised health insurance targeting the poor and the near-poor, and contributory health insurance for employees of the formal sector. We use the sequential implementation of health insurance that preceded the establishment of the unified National Health Insurance programme (Jamima Kesehatan National (JKN)) in 2014 as a natural experiment to investigate how changes in health insurance status have influenced health outcomes and health care utilisation. We first focus on infant mortality as the health outcome of interest, which is arguably quite sensitive to changes in access to health care services (Currie andGruber 1996, Dow andSchmeer 2003). We would expect health insurance to reduce infant mortality through its effects on health care utilisation by insured mothers, as births attended by a skilled health professional have been found to be a predictor of reduced infant mortality in the neonatal stage (Lawn et al. 2005). We first estimate average causal treatment effects for these outcomes, using data from the Indonesian Family Life Survey (IFLS) (Strauss et al. 2004(Strauss et al. , 2009(Strauss et al. , 2016, a rich and high-quality longitudinal survey of Indonesian individuals and households that allows controlling for observed confounders of the causal relationship between health insurance as the 'treatment' and health care utilisation and health outcomes as the effects. In light of the notable geographical, ethnic, and economic disparities within Indonesia, it is expected that average policy effects mask important within-country heterogeneity in the effect of health insurance programmes. For the optimal targeting of interventions, health policymakers need to know how the impact of health insurance varies across different subgroups, in particular for those groups most vulnerable in terms of disease burden and access to health care (Lagomarsino et al. 2012): mothers with low education, those in the bottom socioeconomic quintiles, and those living in remote, rural communities. We therefore also estimate individual treatment effects using the causal forests approach and aggregate these to estimate subgroup-average treatment effects. We do so both for ex-ante specified subgroups following traditional practice, and via a data-driven ML approach that characterises those variables most associated with heterogeneity. This paper makes two main contributions. First, by evaluating the impact of health insurance on health care utilisation, we demonstrate the value of using novel causal ML methods for health policy evaluation. In particular, this is the first study that uses the causal forest approach in the context of a health policy evaluation to characterise the drivers of treatment heterogeneity. We highlight the challenges of conducting such evaluations, specifically the need to account for confounding due to observed covariates that affect individual participation in the health policy and outcomes, under the strong assumption of no unobserved confounding. Second, we offer novel and policy-relevant empirical evidence on the population-level effects of health insurance coverage, by characterising the heterogeneous impacts of public health insurance expansions on maternal health care utilisation and infant health, focusing on the specific context of Indonesia.
In the following sections, we first present the institutional setting in Indonesia (Sect. 2.1), briefly review the literature on the impact evaluations of health insurance (Sect. 2.2), and present the data used in the study (2.3). Then we describe the methods (Sect. 3), with a dual focus on the theory and practical implementation of causal forests. The results are presented in Sect. 4, followed by a discussion of the findings and future avenues of research (Sect. 5). The online appendices include additional Tables (Appendix  B) and Figures (Appendix C), as well as software code (Appendix D) to implement the causal forests analysis presented.

Institutional setting
With an estimated population of over 270 million in 2019, 1 Indonesia is the fourth most populous country in the world. Total health spending was 3.1% of gross domestic product (GDP) in 2016 (World health statistics 2019), with a relatively small share of total health expenditures being publicly funded (39%) (Mahendradhata et al. 2017 Historically, health insurance in Indonesia was available as contributory schemes, known as Askes and Jamsostek, for those employed in the formal sector and their family dependants (Achadi et al. 2014). 2 For poor households that were not eligible for these health insurance programmes, from 1994 a Health Card programme provided free basic health care at public health facilities (Johar 2009). The Askeskin scheme, established in 2005, was the first national, subsidised health insurance programme, basing eligibility on a combination of geographic and individual-level criteria (Sparrow et al. 2013). The insurance scheme covered a comprehensive package of health services (outpatient care and inpatient care, mobile health services, immunisation and medications), with the premium fully subsidised by central government (Sparrow et al. 2013). The scheme left a large group of households without health coverage, i.e. those not poor enough to be eligible but also not having access to contributory health insurance in the formal sector. In 2008 it was re-organised, and the resulting Jamkesmas scheme expanded the eligible population, targeting the poor and 'near poor', based on a combination of means testing (using 14 assets recorded in a National Poverty Census Survey indicators) and local government eligibility criteria (Harimurti, et al. 2013). However, not all households eligible for the programme possessed a membership card due to perceived stigmatisation from health care providers and concerns about long waiting times (Harimurti et al. 2013). Despite the means testing, a significant "leakage" occurred, resulting in households in higher income 1 3 quantiles also receiving free health insurance (Harimurti, et al. 2013). While in principle, Jamkesmas provided a comprehensive package, in reality, the availability of services was limited, especially in rural areas, thereby contributing to large geographic inequalities in access (Harimurti, et al. 2013). To compensate for the large gaps in insurance status, district governments provided decentralised health care financing schemes offering subsidised health insurance, known as Jamkesda (Sparrow et al. 2017). 3 In 2014, all health insurance schemes were absorbed into a single national health insurance scheme, JKN, aimed at continuing to expand health insurance coverage to the total population, with the original stated objective to achieve universal health coverage by 2019.

Related literature on the impact of health insurance on health outcomes and utilisation
Evidence that health care utilisation increases as a result of providing health insurance is growing (e.g. Trujillo et al. 2010, Yilma et al. 2015. While country-level analyses have found that increasing health coverage through national-level health spending is beneficial for health, particularly within a system of risk-pooling (Moreno-Serra and Smith 2015), evaluations using less aggregated (subnational or individual level data) provide more mixed findings (Acharya et al. 2013, Erlangga et al. 2019a). Few studies found conclusive evidence of a health-improving impact (Wang et al. 2009, Mensah et al., 2010, with some finding either no evidence of a positive impact (e.g. Dow andSchmeer 2003, Chen andJin 2012), or even adverse impacts (Fink et al. 2013). For Indonesia, quantitative impact evaluations of the different stages of health insurance expansions also reveal a mixed picture. Johar (2009) finds no evidence that the Health Card programme increased health care utilisation among the poor, and attributes this finding to inelastic demand amongst the recipients. Evaluations of the Askeskin programme found some increase in financial protection (Aji et al. 2013), but only a modest impact on health care utilisation among the beneficiaries (Sparrow et al. 2013). An evaluation of the early implementation of the JKN programme (between 2007 and 2014) found that while contributory health insurance increased both inpatient and outpatient utilisation, subsidised health insurance only increased inpatient utilisation, and to a smaller extent (Erlangga et. al. 2019b). There are various reasons why impact evaluations of health insurance expansions may not always demonstrate measurable improvements in health outcomes. First, establishing the causal effect of health insurance programmes is challenging due to confounding: those receiving health insurance programmes are systematically different from those not receiving it. Correcting for such confounding requires either exploiting 'natural experiments' through quasi-experimental econometric techniques (Wagstaff 2010) or measuring enough variables to adjust for these differences. In addition, the availability of health insurance may affect specific sub-populations differently. Understanding this differential impact is crucial to inform health policymaking in Indonesia, where a large segment of the population remains uninsured. Erlangga et al. (2019b) looked at different impacts by subgroups, and found that the lowest income quintiles did not benefit from improved in-patient utilisation, with no effects in areas with low density of healthcare facilities. Anindya et al (2020) identified significant impacts of the JKN programme on maternal health care utilisation (skilled birth attendance, institutional deliver, antenatal care visits), and found that mothers from lower socioeconomic quintiles and more deprived regions benefitted more from health insurance.

Data
The IFLS household dataset includes respondents living in 13 out of the 27 Indonesian provinces, initially using the sampling frame of the 1993 national household socioeconomic survey (Survei Sosial Ekonomi Nasional-Susenas) from the Central Bureau of Statistics. 4 The first round was in 1993 (IFLS1), covering 7,224 households. Subsequent rounds were conducted with the same respondents and their new household members in 1997 (IFLS2), late 1998 (IFLS2 + with a 25% subsample), 2000 (IFLS3), 2007/2008(IFLS4) and 2014. In order to exploit temporal variation in the availability of the health insurance schemes, we use the IFLS waves which were collected in the pre-Askeskin period (IFLS 3), in the pre-Jamkesmas period (IFLS4), and in the post-Jamkesmas period, covering the start of the JKN programme up to 2015 (IFLS5). We refer the reader to Appendix B Table 1 for the links between the various policy reforms and survey waves, and the construction of the analytical dataset, which we describe in some detail below. We constructed a birth-level dataset using the complete pregnancy histories available for women aged 15 to 49, including the date of birth of each child, whether the child is still alive, and if not, the age at death. Restricting the recall period to 6 years to minimise recall bias, we collated births between 2002 and 2007 from IFLS4, and between 2008 and 2014 from IFLS5. We use child birth as the unit of analysis throughout the paper.
We defined two treatment groups and a control group by assessing the mothers' insurance status and the type of health insurance in the year of each child birth. The first treated group, referred to as subsidised insurance, consists of births where in the year of the child birth, a mother reported enrolment in one of the following insurance schemes: Health card (2002)(2003)(2004)(2005)(2006)(2007), Askeskin (2005)(2006)(2007), Jamkesmas (2008( -2014( ), Jamkesda (2008( -2014 or JKN (2014), where the years in parentheses indicate the years that a given type of insurance appears in our data sets. The second treatment group-contributory insurance-was defined as births where, in the year of the child birth, a mother reported enrolment in the Askes or Jamsostek or other employer provided insurance programmes, and these insurance types can be found throughout the study years. 5 Finally, 'uninsured' is defined as a birth where the mother has not reported any subsidised or contributory insurance in the year of the given child birth. Uninsured births, again, can be found throughout the study period. Those births for which a mother reports having both subsidised and contributory insurance 1 3 were excluded from the analysis, 6 as such double insurance, while it did occur in practice, was not formally allowed. 7 We emphasise that we do not treat this dataset as longitudinal in the subsequent statistical analysis as individual births are the unit of analysis. We do allow both for women who only had one birth over the study period and for women who had repeated births, and allow these births to enter any of the three groups. We also use the time dimension of our data in several ways. First, due to the nature of the dataset, we are able to tell whether in the year of a given child birth, a woman is reported to be insured or not, and with which insurance type. Second, over time, there is an increasing number of insured women and births, due to the health insurance expansion. Third, we use year (birth cohort) dummies to allow for unobserved common time trends, and province fixed effects to allow for unobserved province level factors (see below).
We linked outcome information on births to insurance information of the mother, as well as her demographic information, her household and community. In line with conventions, the death of a child is classified as "infant death" if the death occurred before the first birthday. Our "skilled birth attendance" variable indicates whether the birth has been attended by either a midwife or a doctor, regardless of place of delivery (both in and out of hospital).
A common concern in the evaluation of health insurance programmes is that individuals self-select into health insurance, based on potential gains unobserved by the researcher (Currie andGruber 1996, Wagstaff 2010). In the Indonesian setting, this problem might manifest in two ways. First, the eligibility assessment was complex, based on geographical and household level criteria that may not be fully captured by the information in our dataset. Second, insurance take-up was ultimately voluntary, leaving the possibility that those who are somewhat better off were less likely to opt for subsidised insurance, due to the perceived stigma and potentially lower quality of services, compared to those obtained in the private sector. To take this into account, we exploited the variation in the expansion of both health insurance schemes, across provinces and over time. Seeking to minimise bias due to selection on unobserved factors that could compromise comparisons of outcomes between insured and uninsured individuals, we adjust for a rich set of household-, individual-and community-level characteristics to approximate the institutional eligibility rules and process of selection into the health insurance schemes. We also control for province-specific effects that capture unobserved confounding factors that are common within provinces and time-invariant (see Sect. 3 for the formal assumptions).
When selecting variables to control for confounding, we focussed on the characteristics of the mothers, households and communities, which contribute to the eligibility and enrolment in the health insurance schemes, and which are considered to also be independently associated with health care utilisation or infant mortality. 8 Following previous studies (Dow andSchmeer 2003, Shrestha 2010), we included mother's education (categorised as primary, senior, secondary, and university), mother's literacy (ability to write a letter in Indonesian), age at birth, sex of the child, birth order of the child, and whether a household was urban or rural. To capture the means-testing eligibility criteria of the subsidised health insurance programmes (Johar 2009), we construct an asset index (O'Donnell et al. 2008), using principal component analysis (PCA) to classify households into wealth quintiles based on asset ownership and household characteristics (see Appendix A for specific variables used in the PCA). We also created a binary variable from the self-reported health of the mother (1 if good or excellent, 0 otherwise). Further indicators of socioeconomic deprivation are considered, in particular we capture participation in three major social assistance programs: a subsidised rice ("Raskin") programme, an unconditional cash transfer programme, and a "poor card" programme. We also added a variable capturing whether the household had been seriously affected by a natural disaster in the preceding five years. We also capture whether community members have access to a village midwife, a birth clinic, a hospital, a public health centre or private health care providers. 9 Indicators for province of residence for the mother, at the time of the survey are also included, in an attempt to control for unobserved confounding at the province level (e.g. in terms of worse access to health care in turn leading to worse outcomes). A year of birth variable seeks to control for time trends affecting changes in infant mortality (e.g. technological innovations in neonatal intensive care), that may have coincided with the gradual expansion of health insurance. For the pre-specified subgroup analysis, we selected three widely used socioeconomic proxies to be able to assess the impact of insurance for those most vulnerable in terms of disease burden and access to health care (Lagomarsino et al. 2012): mothers with low education, those in the bottom socioeconomic quintiles, and those living in remote, rural communities.

Notation and estimands
We are interested in estimating causal effects of a mother being enrolled in one of two health insurance types (subsidised or contributory) versus no health insurance, on one health outcome (infant mortality) and one health care utilisation outcome (skilled birth attendance) for a given birth, henceforth referred to as a unit. We conduct these analyses separately, and use a common notation Y for both outcomes, and W for both health insurance schemes. Denote the potential outcome for a given birth i by Y i (w) , with w ∈ 0,1. The individual treatment effect is the difference between the two potential outcomes, Our main identifying assumption is that of unobserved confounding, requiring thatY(1), Y(0) ⟂ W|X , or that after adjusting for the sufficient variable set X the 1 3 potential outcomes are independent of the observed insurance statusW . Further assuming no interference, consistency, 10 and overlap, 11 we can identify the estimands of interest, the average treatment effect (ATE), the average treatment effect on the treated (ATT), and the average treatment effect on the controls (ATC).
These three estimands answer different policy evaluation questions. The ATE = E[Y(1) − Y(0)] contrasts the potential outcomes in a world where everyone has a given insurance, and where no one has insurance, and takes the average of these causal contrasts over the pooled population of the uninsured and the insured. The ATT, defined as E[Y(1) − Y(0)|W = 1] answers the question: how much did those who had a certain insurance type benefit from having that health insurance, compared to not having insurance? Finally, the ATC, defined as E[Y(1) − Y(0)|W = 0] aims to answer the question: how much the uninsured would have benefitted from having a given insurance type? The ATC also allows us to contrast the impacts of the two insurance types, as the population for whom the benefits are calculated is held constant at the uninsured, representing a large portion of the population in Indonesia in the study period, including subgroups from all socioeconomic quintiles.
Beyond population average treatment effects, there is interest in the conditional average treatment effect (CATE) defined as The CATE can be conceptualised as a function that takes a combination of observed covariates that are assumed to modify the effect of the treatment, at a selected covariate profile x, and outputs a treatment effect that corresponds to this covariate profile. In the context of health insurance, we expect that a range of the observed covariates can modify the treatment effect, beyond the socioeconomic factors listed above. The geographical availability of health services may be one such example.

Estimation of average treatment effects using a parametric double-robust approach
As a starting point we assume a linear predictor for each outcome of interest with identity link: where Y i indicates (a) the survival status of infant i born in year t at 12 months after birth (b) whether the birth was attended by health professional, and the vector X i = (Z mt , Z ht , Z ct , p , t ) includes several components: Z mt denotes the characteristics of the mother (e.g. education), Z ht captures household characteristics (e.g. household asset quintile, social assistance), Z ct captures community level variables (e.g. availability of hospital or birth clinic in the neighbourhood, or availability of a village midwife in the year of (1) The no interference assumption requires that a unit's outcome is not affected by the treatment received by other units (Tchetgen Tchetgen and VanderWeele 2012). The consistency assumption requires that the observed outcome corresponds to the potential outcome under the observed treatment (VanderWeele 2009). 11 The overlap assumption requires that there must be a positive probability to be enrolled in a given health insurance programme, but this probability must be strictly smaller than 1: no covariate combination should fully determine a mother's insurance status. birth), p are the effects of unobserved time-constant factors at the province level, and t is the birth cohort indicator capturing shocks over time. W i is the treatment of interest, i.e. whether in the birth year of childi , the mother had a given health insurance ( W ∈ (0,1)) , is the treatment effect of interest. The residual term i is assumed normally distributed, mean zero, and captures a composite of any unobserved province, community, household, mother and child level shocks. It follows from the previously stated assumptions that W i is uncorrelated with i implying that any unobserved health shock to the mother, or income shock to the household, beyond those captured by the year fixed effects is unrelated to whether a mother is enrolled in health insurance in a given year. The outcome regression, Eq. (1), assumes a homogenous additive treatment, hence cannot be directly interpreted as estimating either one of the ATE, ATT or ATC defined before. Moreover, the model assumes a linear relationship between the outcome and the covariates being correct (Ho et al. 2007), and the resulting regression model relies heavily on extrapolation.
To address these restrictions, we also estimate propensity scores (Rosenbaum and Rubin, 1983), defined as p(X) = (W = 1|X) , estimated via logistic regression including all the sufficient covariates as in [1], 12 for each health insurance status. Because using PS only for confounder adjustment would result in a different limitation-bias stemming from poor overlap-we use the inverse propensity scores to weight linear outcome regression models to construct the so-called Wooldridge double robust (DR) estimator (Wooldridge, 2007), where both the reliance on extrapolation and potential overlap problems are reduced. We implement this method using the teffects command in Stata, and obtain the estimated ATE, ATT and ATC. The unweighted (OLS) regression results are presented in Appendix B Table 4. These more traditional estimates are later contrasted to the causal forest estimates for average treatment effects (see next section). All models use the same set of covariates for confounder adjustment.

Estimation of heterogenous and average treatment effects using causal forests
The ATE, ATT and ATC estimands allow for the causal effects to be different for those insured and uninsured, but do not capture their variation over the observed X covariates. For this we focus on the CATE, (x) . We begin by considering a partially linear model for the outcome of interest, as before, that is: with f (X) an unspecified function, and initially, that , the treatment effect, is constant in X . Following Robinson (1988), we can re-write this model in a "centred" or residualised form as follows where p(X i ) is the propensity score as before, and m X i = E Y i |X i the conditional expectation of the outcome, marginalised over the treatment. The expressions m(.) and p(.) are often referred to as "nuisance functions", and they can be estimated with any prediction algorithm, including ML methods. The causal effect τ can be estimated by solving Eq. (3), and plugging the predictions for m X i and p(X i ) in the following formula: This corresponds to running a regression of the Y-residual on the W-residual. Such "residualising" decreases the sensitivity of the resulting estimator to the errors in the estimates of the nuisance functions (Chernozhukov et al. 2018a). This can be extended to allow for heterogenous treatment effects, assuming a sufficiently small neighbourhood N(x) such that (x) is constant, which allows us to rewrite Eq.(4) as The main challenge for CATE estimation is how to choose N(x) . To solve this, Athey et al. (2019a) propose a generalised random forest approach, which conceptualises these neighbourhoods as a locally weighted set of neighbouring observations for a given value of x . The weights are estimated by performing a modification of the Random Forest algorithm (Breiman, 2001). In short, random forests calculate a predicted outcome for a unit by averaging the outcome of other units that are similar enough in covariates. The group of similar units are referred to as a leaf of a tree, and leaves are decided on by splitting the data based on cut-off values of the predictors, where the predictors to split on and cut-offs are decided so that the resulting splits minimise the prediction error in the sample. To reduce the noise stemming from using individual trees as predictors, this is done many times over bootstrapped samples of the data, and final predictions for each observation are obtained as the average of predictions over the bootstrap samples.
Generalised random forests build on this algorithm, but modify it in important aspects, to ultimately minimise the bias in the estimated CATE. First the outcomes and treatment are residualised as described before. Second, the splits of the data ("the causal trees") are formed by running the local linear regressions (Eq. 3) in each candidate split. Instead of choosing splits to minimise prediction error, they are chosen so that within a leaf, estimated treatment effects are similar (corresponding to homogenous treatment effects within a leaf), while between leaves, they differ (capturing treatment effect heterogeneity across units with differing X values). This procedure is performed on many bootstrap samples, thus forming causal forests. The causal forests are then used to calculate i (x) weights for each observation, based on how frequently an observation was used to estimate the treatment effect at x. The resulting weights are employed in an estimator of the CATE that modifies Eq. (4) as follows: Individual treatment effects ̂ (X i ) can be estimated by evaluating ̂ (x) at the covariate combination of each unit. Average treatment effects can also be obtained, by plugging in the estimated ̂ (X i ) in a variant of the augmented inverse probability weighting estimator (Robins et al. 1994): where the summation is taken over n D , that stands for the sample of the treated, the control or the treated plus control samples, depending on the whether the causal estimand is the ATT the ATC or the ATE, respectively. This formula also provides the subgroup average treatment effects, constraining the summation for units in the subgroups of interest (e.g. women with primary education only).
The causal forests approach, as implemented in the grf R package (Tibshirani et al. 2018), in its simplest form can be run using the causal_forest(X, Y, W) command, where X is the vector of confounders and potential effect modifiers, Y is the outcome, and W is the treatment. In order to improve the performance of the estimation, we follow the approach suggested by Athey and Wager (2019b). First, motivated by the double-machine learning literature (Chernozhukov et al. 2018a, b), this approach relies on an initial residualizing of the treatment and outcome variables (following Robinson 1988, as described in Eq. 3) in order to minimise confounding due to observed covariates. Second, the approach fits two sets of causal forests: first, a "pilot" causal forest, using all confounders as potential effect modifiers, then a final causal forest, using only those variables which were ranked highly in the variables importance analysis. This enables the final forest to make more splits on the most important features, even in situations where the heterogeneity in treatment effects is relatively weak (Athey and Wager (2019b).
The steps taken in this paper are described as follows: 1. Estimate nuisance parameters: Fit regression forests to estimate m X i and the p X i , then calculate residualised outcomes using these quantities (see Eq. 3). We use 500 trees to select the tuning parameters, and 1000 trees to obtain the predictions.

Train and fit causal forests:
a. Train an initial causal forest on 1000 bootstrap samples (with 500 trees to select tuning parameters), using the entire set of covariates for splits. b. Use the output from 2a, and rank variables in terms of variable importance in the initial causal forest (based on count of the proportion of splits on the given variable). Select those with higher than mean variable importance measure. c. Fit a second causal forest, using only those variables selected in Step 2b (with variable importance > mean variable importance). We use 500 trees for tuning, and 3000 trees for predicting ITEs.  (Chernozhukov et al. 2018b). This test assesses whether ̂ (x) captures any further information than simply using the ATE, ̂ to "predict" the individual level treatment effects.
c. Assess the final ranking of the variable importance measure, and form further subgroups based on the top ranked variables, and contrast the differences in the average treatment effects across these subgroups. d. Split individuals into two groups based on their estimated CATEs (below and above median), and describe these groups in a number of key characteristics.
We implement this approach for the skilled birth attendance outcome variable, and fit separate causal forests for the subsidised and contributory health insurance. The covariates used in Steps 1-2 include all variables used in the previous analyses. For the infant mortality outcome, we implement steps 1-3, and report average treatment effects.

Descriptive statistics
While the majority of births recorded in our dataset were not covered by any insurance scheme (Appendix B Table 2), subsidised health insurance saw a steep increase from 2005, while infant mortality decreased and the proportion of births attended by a midwife or physician demonstrated a clear upwards trend (Appendix C Fig. 1). In Table 1, we contrast the observed characteristics of the three groups: births insured by subsidised insurance, births insured by contributory insurance, and births not covered by health insurance in the year of birth, comparing the means and standardised differences for each treatment group to the control group. Most variables display large differences (standardised differences > 10%), with births under subsidised insurance being more likely to be from a rural household and from mothers who are older at birth, less likely to have studied at university and more likely to have only elementary school education, belong to lower wealth quintiles, and receive social assistance programmes, compared to those without subsidised insurance. By contrast, while those mothers with contributory insurance are also somewhat older at the time of birth than the uninsured, they are also more likely to have a university education, and are overrepresented among households within the highest asset index quintiles. A quarter of these mothers received subsidised rice, while only a small fraction received cash transfer (7%) or held a "Poor card" (4%). We interpret these large differences as indicative of a strong confounding of the relationship between health insurance and the outcomes of interest. Table 1 and Appendix Fig. 2 describe the covariate balance achieved after inverse probability weighting using the estimated (logistic regression based) propensity scores for both treatment groups compared to the control group, and contrasts these to the unweighted balance. Using weights that aim to recreate the distribution for the treated (ATE, ATC and ATT weights), the balance improves for each covariate, and standardised differences stay above 10% for only a few covariates, and the ATE weights showing somewhat worse balance than the ATT and ATC weights. Appendix Fig. 3 displays the distributions of the estimated propensity scores. While there is a good overlap between the propensity score distributions for both insurance types, there is a large mass around zero for the uninsured, implying that many of those who did not get the insurance were unlikely to get it based on their observed covariates. Table 2 reports the average treatment estimates for both outcomes, both using the Wooldrige DR (Panel A) and the causal forests (Panel B) approaches, and contrast these to unadjusted estimates that simply compare the means for the treated and control groups. For subsidised health insurance, the unadjusted results for infant mortality are small and insignificant, and even after covariate adjustment using the Wooldrigde DR and the causal forests estimators, there is no evidence that they are different from the null, across all estimands. For the contributory health insurance, there is strong evidence of a large protective (i.e. infant mortality-reducing) insurance effect of around 1-2 percentage points, with the estimated ATE and ATC larger than the ATT, indicating that the uninsured would have benefitted more from the insurance than those who were actually insured. This pattern repeats with the skilled birth attendance outcome, for both insurance types: the benefit in terms of increased access is larger among the untreated than among the treated, while these estimated effects are significant (p < 0.01) for the contributory health insurance, and not significant for the subsidised health insurance, and the causal forest estimates closely correspond to the DR IPW-regression estimates.   Figure 1 presents the distribution of the estimated individual treatment effects, as histograms of the point estimates. The formal test for treatment effect heterogeneity indicates the presence of heterogeneity for the contributory health insurance (p = 0.003), but not for the subsidised health insurance (p = 0.69). Table 3 presents the ranking of covariates in terms of their importance in predicting treatment effect heterogeneity, in terms of utilisation of skilled attendance when giving birth. For the contributory health insurance scheme, these largely overlap with the pre-specified socioeconomic covariates: education, wealth quintiles, and the rurality of the household. The most important variable associated with the estimated heterogeneous effect was the indicator for East Java province: a relatively industrialised region of Indonesia. For the subsidised health insurance scheme, the most influential variables were mother's age and the birth order of the child, followed by being in receipt of cash transfers and possessing a poor card. We then present the CATE among the controls for pre-specified subgroups (Fig. 2a) as well as subgroups constructed based on the variables suggested by the final causal forests variable importance Fig. 2b) (Also see Appendix Table 4). We detect large differences in subgroup ATCs for contributory health insurance corresponding to subgroups  suggested by variable importance: there is a strong trend in terms of wealth quintiles in the estimated subgroup effects, and there are also considerable differences in ATC reported between those with different education levels, and between rural and urban communities. The differences in the subgroup effects, while showing a similar direction, are much less pronounced for the subsidised health insurance, and there is no evidence in support of a subgroup ATCs being different from zero. Among the subgroups suggested by the causal forest variable importance, for the subsidised scheme we found some evidence (p < 0.05) of treatment effect for the subgroup with the third or higher birth order. None of these results were found to be sensitive to the choice of tuning parameters for the causal forest algorithm, which were selected outside of the cross-validation algorithm (number of trees used for tuning, number of trees used for the final Causal Forests). We present the selected tuning parameters in Appendix Table 5. As a final, exploratory analysis, we compare the characteristics of mothers when they are grouped based on the estimated individual level CATEs, using the median value as the cut-off (Appendix Table 6), and using SMDs for the comparison. It appears that mothers who benefitted relatively more from the subsidised health insurance are older, more likely to be in lower wealth quintiles, and more likely to have received cash transfer or rice subsidy, than those in the lower half of the treatment effect distribution. Those benefitting most from contributory health insurance are also more likely to belong to the lower wealth quintiles, less likely to have had higher levels of education, and twice as likely to have received subsidies, compared to those in the lower half of the distribution. There is no difference in the availability of health services among the two groups. To investigate the surprising result of East Java being an important driver of heterogeneity of the impact of contributory health insurance, we followed the suggestion of Semenova et al. (2021) and explored the independent contribution of the East Java variable, by running a linear regression on the double-robust score constructed from the estimated CATEs and the nuisance components, on a selected covariate vector: the prespecified subgroups and the East Java variable. Here we find that for a child being born in East Java, the beneficial effect of contributory health insurance is relatively larger than for the overall control group, even after controlling for the heterogeneity that can be attributed to education, socioeconomic status and rurality.

Discussion
This paper is the first study to characterise the effect heterogeneity of a health policy intervention by employing causal forests, a causal machine learning approach that is quickly gaining popularity in economic and social science research. We highlighted the role of this approach for establishing heterogeneity of policy impacts, and further for suggesting the main observed covariates driving such heterogeneity. Our study also highlights a crucial challenge when using this approach to estimate treatment heterogeneity in an observational framework: the need to adjust for the key observable confounders in the institutional setting of interest. The causal forests algorithm allows to adjust for observed confounding in the first step of the analysis, by using the outcome and treatment residuals (from the corresponding models adjusting for the confounding variables), using flexible machine learning algorithms, in this case random forests. While in this study we use regression forests to estimate these residuals, other supervised learning algorithms, such as ensembling machine learning methods, may also be used for this step. Even when the nuisance parameters are estimated with non-forest-based algorithms, the causal forests method still utilises the extension of the random forest algorithm described earlier, for estimation of the CATEs. Ongoing work (Nie and Wager 2021) further generalises the causal forests approach and allows for any supervised learning algorithms that can solve a loss function designed to target the CATE estimand (the so-called R-learner loss function). Moreover, our analysis demonstrated the ability of the causal forest algorithm to facilitate pre-specified subgroup analysis without having to re-fit propensity score and outcome regression models for the subgroups, but instead taking the estimated individual level CATEs, and plugging them in an augmented IPW estimator for average treatment effects. We then used formal statistical tests to assess the presence of treatment effect heterogeneity. It should be noted that the causal forest approach is not designed to identify and conduct inference on subgroups with the largest (or lowest) treatment effects. It can, however provide indicative evidence on which variables are most strongly associated with heterogeneity, using variable importance measures that characterise the resulting causal forest. We used these variable importance measures to select further variables to assess subgroup effects on. There is ongoing work on developing estimators specifically for group average treatment effects discovered by machine learning (Chernozhukov et al. 2018b).
This paper also contributes to the growing body of evidence on the impact of public health insurance on health outcomes and health care utilisation, by estimating the average and heterogenous treatment effects of two main types of health insurance in Indonesia on infant mortality, and on maternal health care utilisation at the time of delivery. We find that enrolment in contributory health insurance reduced infant mortality on average by 1.0 percentage points (p < 0.05) among those who were insured, corresponding to a sizeable 30% reduction from the average infant mortality rate (i.e. infant deaths per 1000 live births) over the observation period. By contrast, we found no evidence of an effect of subsidised health insurance. Our findings for the health care utilisation outcome may help explain these results: contributory insurance increased the expected probability of having a birth attended by a healthcare professional, while there was no such effect for the subsidised scheme. Our findings mirror the previous evidence that found small to negligible impacts of subsidised health insurance schemes on health services utilisation (Johar 2009, Sparrow et al. 2013, Erlangga et al. 2019a), but they are also consistent with the findings of Anindya et al. (2020) who found that the JKN programme improved the utilisation of skilled birth attendance, on a population that pooled subsidised and contributory recipients.
We delved deeper into this, by examining the heterogeneity in the effects for both insurance schemes. The estimated causal effects on health care utilisation among the uninsured appear to be higher than among the insured (that is, on average, we expect those uninsured in the study period would benefit from being insured more than the expected benefit estimated amongst those who are insured). Indeed, we found that the benefits, in terms of increased access to skilled birth attendance, are relatively higher among the more vulnerable subgroups, reflecting the findings of Anindya et al. (2020). While pre-specified socioeconomic variables ranked high in terms of being associated with treatment effect heterogeneity, we found further variables that according to the variable importance of the causal forest algorithm were more strongly associated with treatment heterogeneity: for example, women residing in certain provinces (e.g. East Java) would have benefited more than other subgroups, had they been insured (contributory vs remaining uninsured). Given East Java has development indicators above the county average (Unicef 2019), this result might hint towards the potential importance of non-health infrastructure in predicting the benefits of a demand side health policy such as health insurance. For subsidised health insurance, there was no strong evidence of a causal treatment effect for any of the pre-specified population subgroups. While we found no evidence of heterogeneity, the variable importance of the CF algorithm suggested the child's birth order may drive treatment heterogeneity, with the resulting subgroup of children who were third born or higher having the highest causal average treatment effect, with a 95% CI that excluded zero.
Our study has some limitations. Due to infant mortality being a rather rare event (approximately 300 events out of 12,000), we could not conduct a subgroup analyses for this outcome. Furthermore, because we use household survey data, covariate information was collected at discrete time points, which were assumed to provide valid baseline measurements for births that occurred closer to or further away from the survey dates. This measurement error can lead to a downward bias in the estimated coefficients.
Our analysis assumed no unobserved confounding, given the measured household and individual characteristics, year and province fixed effects. Previous impact evaluations of health insurance expansions in Indonesia also relied on adjustment for observed confounders: for example, Johar (2009) and Anindya et al. (2020) utilised a propensity score matching approach for the evaluation of the health card and JKN programmes, respectively. Sparrow et al. (2013) combined propensity score matching with differences-in-differences when evaluating the Askeskin programme, and Erlangga et al. (2019a) used the same approach for the evaluation of the JKN programme. By controlling for self-reported health, literacy and socioeconomic factors, we aimed to approximate the process of selection into subsidised health insurance and hence minimise any remaining bias due to selection on unobserved factors that could compromise comparisons of outcomes between individuals in subsidised insurance with those uninsured.
Nonetheless, we cannot fully rule out residual confounding biasing the results-this bias might appear both in the average treatment effect estimates, the estimated CATEs, and through them, in the resulting subgroup average treatment effects. For example, we cannot rule out that for the contributory health insurance evaluation, some of the difference the ATC and the ATT, and between the CATC for the socioeconomic subgroups, may be due to remaining unmeasured confounding. One potential avenue to explore this would be to use methods that can explicitly handle unobserved confounding (instrumental variables (IVs) or regression discontinuity design). Under the strong assumption of an instrument being valid (predictive of health insurance, and meeting the exclusion criteria), we could re-estimate the average and subgroup average treatment effects, and comparing these to the causal forest estimates, potentially discover that in certain population subgroups, unobserved confounding played a relatively larger role. In order for unobserved confounding to explain the differences in the estimated CATEs in the current study, unmeasured confounding should be stronger in the more deprived population subgroups, something we don't have a priori reasons to believe. While for the current study we did not find IV that meet the exclusion criteria, this is an area of further investigation. A second approach, potentially promising to explore the sensitivity of heterogenous treatment effects to unobserved confounding is to employ sensitivity analysis to calculate bounds on the CATEs. This is an innovative, and currently developing area of methodological research (see e.g. Kallus et al. 2019), and is hence outside of the scope of our paper.
Despite these limitations, this paper provides a novel demonstration of the value of causal machine learning for public policy evaluation, in a setting where heterogenous treatments effects have the potential to critically inform policymakers. Based on our work, we can suggest at least two promising avenues for future related research. First, to address concerns of remaining unobserved confounding, Generalised Random Forests could be combined with instrumental variables, where valid instruments are available (Athey et al. 2019a). Second, the estimated individual treatment effects could be used to formulate socalled "optimal policy rules": treatment assignment mechanisms that maximise a pre-specified welfare function set by the decision-maker (Athey and Wager 2020). For the Indonesian context, such optimal policy rules could inform health policymaking in the country by guiding the re-design of the eligibility criteria for subsidised health insurance, which could help address the fiscal challenges brought by the move towards Universal Health Coverage. Beyond the specific case of Indonesia, causal machine learning may be used to help target policy efforts towards where the greatest potential benefits can be realised, thereby helping to pinpoint where adaptation of policy may be needed. In doing so, this could enable researchers to move policy impact evaluations beyond simple binary judgements on whether something 'works' or not, towards matters of for whom policies 'work' and how these can be improved.

Appendix B
See (Tables 4, 5 , 6, 7, 8, 9).     Fig. 3 Trends in the probability of infant mortality, health care utilisation, and in the proportion of births covered by subsidised and contributory insurance Fig. 4 Overview of balance after propensity score weighting: standardised mean differences of covariates involved in the propensity score analysis