Predicting household resilience with machine learning: preliminary cross-country tests

Using a unique cross-country sample from 10 impact evaluations of development projects, we test the out-of-sample performance of machine learning algorithms in predicting non-resilient households, where resilience is a subjective metrics defined as the perceived ability to recover from shocks. We report preliminary evidence of the potential of these data-driven techniques to identify the main predictors of household resilience and inform the targeting of resilience-oriented policy interventions.


Introduction
Following a surge in the interest for the so-called "prediction policy problems" (Kleinberg et al. 2015), the literature on the use of machine learning (ML) in economics and public policy studies is rapidly expanding (Athey 2018;Athey & Imbens 2018;Kleinberg et al. 2018;Mullainathan & Spiess 2017). In parallel, the notion of resilience, defined as the capacity over time of individuals, households, or communities to withstand a myriad of shocks and stressors, is becoming a central paradigm in the development agenda. Its theoretical underpinnings, as well as different empirical methodologies for its measurement, have lately been validated in several scientific articles belonging to the so-called 'development resilience' literature (see, among many, Barrett & Constas 2014;Brück et al. 2019;Cissé & Barrett 2018;d'Errico et al. 2019;d'Errico et al. 2020;Smith & Frankenberger 2018). This paper is placed at the intersection between these two strands of research.
A separate and nascent body of empirical work has started testing the potential of ML in predicting well-being measures. In development economics, ML has been lately applied to predict and map poverty (Blumenstock et al. 2015;Jean et al. 2016;Kshirsagar et al. 2017;McBride & Nichols 2018;Perez et al. 2019;Steele et al. 2017) as well as food security Hossain et al. 2019;Lentz et al. 2019) outcomes, highlighting the great potential of these predictive tools to improve the old problematic issue of the (in)effective targeting of development programmes.
A recent and comprehensive review of the flourishing literature devoted to the conceptualization and measurement of development resilience is carried out by Barrett et al. (2021). The authors first highlight the three main conceptualizations of development resilience: (i) resilience defined as a capacity, e.g. the "capacity that ensures stressors and shocks do not have long-lasting development consequences"  and is captured as a latent and multidimensional variable combining observable and unobservable features (Alinovi et al. 2008(Alinovi et al. , 2010Brück et al. 2019;d'Errico et al. 2020;d'Errico & Di Giuseppe 2018;Smith & Frankenberger 2018); (ii) resilience as a normative condition, i.e. the probability of achieving some minimal standard living conditional of many observable characteristics and exposure to shocks (Barrett & Constas 2014), which implies that resilience is treated as an outcome in impact evaluation studies (Knippenberg et al. 2019;Upton et al. 2016); (iii) resilience as return to equilibrium in the aftermath of a shock, where the focus is on the ex-post effects of the shocks experienced on some well-being outcomes Hoddinott 2014;Knippenberg et al. 2019).
Then, they provide an overview of the empirical quantitative literature on resilience, emphasizing several limitations of current approaches involving theoretical, empirical, and data-related constraints. Importantly, among these, Barrett et al. (2021) stress that concerns have been raised about the current ability of the most popular methodologies for resilience measurement described above (that do not make use of ML techniques) in accurately predicting outcomes out-of-sample. This is a task that ML models, which are built to excel at predicting outcomes (Varian 2014), can, in principle, accomplish. Indeed, scholars have recently raised a call to harness the opportunities provided by machine learning algorithmic procedures to identify better predictors of resilience, predict and highlight the presence of vulnerability hotspots (Jones et al. 2021) and, in turn, improve the design of effective early warning mechanisms (McBride et al. 2021). To the best of our knowledge, however, only one paper to date has investigated how ML methods can predict household resilience, e.g. the contribution by Knippenberg et al. (2019). As part of a broader empirical exercise involving a comparison among different methodologies, the authors of this study apply two ML techniques, namely the Least Absolute Shrinkage and Selection Operator (LASSO) and random forests, to identify the best predictors of a resilience measure based on the Coping Strategy Index of Malawian households.
We build on their pioneering work by providing preliminary cross-country evidence on the potential of ML to improve the study of household resilience as well as the targeting of policy interventions. Importantly, we are interested in accurately predicting resilience status in addition to identifying its best predictors. For this reason, unlike Knippenberg et al. (2019), we tackle resilience prediction as a classification problem rather than a regression one. In addition, we focus on a cross-country context and use a different proxy for household resilience and a broader set of ML algorithmic routines.
Leveraging a large dataset spanning 10 countries and data-driven resilience prediction via ML, we show that: (i) ML algorithms perform well even when studying households from very different contexts and with a limited amount of widely available information; (ii) simpler algorithms perform almost as well as 'black-box' methods (i.e. complex predictive techniques that do not produce an understandable model and are thus characterized by scarce or null explainability) and may be preferable because of their transparency and interpretability.
The results shed light on the predictive potential of ML to both improve the allocation of projects' funding and better target resilience-oriented policy interventions to those most in need, which would, in turn, maximize the beneficial effects of these development policies.
The rest of this paper is structured as follows. Section 2 presents the data and the machine learning approach. Section 3 reports the results of the empirical analysis. Section 4 discusses the main policy implications and concludes.

Data
Our dataset is composed of cross-sectional household-level surveys from micro-level impact evaluations fielded by the International Fund for Agricultural Development (IFAD). 1 These impact assessments evaluated a selection of the Fund's development projects, closing between 2016 and 2018. Among these studies, we only selected those with available comparable resilience metrics and socio-economic characteristics. This led us to a final sample of 10 countries for more than 14,000 households observations. All the data come from cross-sectional surveys, with the partial exception of the PASIDP-I project in Ethiopia. 2 The list of projects included in our dataset is listed in Table 3 in the Annex.
Concerning the outcome variable, we employ a subjective metric of resilience, i.e. the ability to recover from shocks (ATR). This metric is constructed based on answers to the following question: "To what extent were you and your household able to recover from shock x?". ATR thus represents a self-assessment from the interviewed households and takes the form of a categorical variable which ranges from 1 to 5 according to the following scale: a. Did not recover (= 1). b. Recovered to some extent, but worse off than before (= 2).
c. Recovered to the same level as before (= 3). d. Recovered, and better off than before (= 4). e. Experienced the shock but was not significantly affected (= 5).
This question is asked repeatedly for a roster of several different x shocks (droughts, floods, crop diseases, etc.) that the households might have experienced in the last year prior to the survey. We first take an average of the ATR for all shocks experienced by the household and obtain an average ATR for each household. In the following step, we create a binary outcome variable to discriminate between resilient vs nonresilient households. This dummy variable takes value 1 if the average household ATR is below the sample mean and 0 otherwise. A value of one thus indicates non-resilient households and a value of 0 resilient ones.
The use of a binary outcome is dictated by our preference to tackle the resilience prediction problem as a classification one. Our assumption is that discriminations above or below clear cut-offs are more intuitive for practitioners, policymakers and humanitarian agencies that aim at efficiently targeting their policy interventions, and therefore predicting cut-offs rather than continuous values is more useful for practical purposes (Lentz et al. 2019). The choice of a subjective resilience metric is driven by: (i) data availability; (ii) the assumption that households are in the best position to assess the extent of shock impacts on their welfare and their post-shock recovery, as well as existing evidence that self-reported measures of well-being go hand in hand with objective indicators (Knippenberg et al. 2019); (iii) the increasing use in recent studies of subjective approaches and self-evaluations as resilience metrics which represent valid alternatives to objective indicators (Jones & Tanner 2017;Jones & d'Errico 2019).
As far as the features that may predict resilience are concerned, we employ a set of 14 predictors whose list, summary statistics, and details are reported in Table  4 in the Appendix. These are the most relevant variables that were common and comparable across all the surveys in our pooled dataset and include demographic characteristics, income measures, asset-based indices, food security proxies, and shock exposure metrics, represented by the number of shocks experienced and their perceived severity.
Importantly, we do not provide the algorithms with information about the country, region, district, or village of origin of our households, for three reasons: (i) our samples are not nationally representative from a geographic point of view; (ii) these geographic dummies would not provide any useful information for targeting as we aim to scale up these projects in other contexts; (iii) we are interested in providing useful insights based on generalizable socio-economic and demographic characteristics, not in identifying resilience clusters derived from geographically non-representative samples.
Finally, as some variables had a small number of missing observations, and since some machine learning algorithms handle missing variables differently, we imputed missing values via proximity through a random forest algorithm to make the results comparable across different methods. 3

Methods
We focus on a purely predictive problem, the prediction of non-resilient households. We are thus interested in minimizing the predictive error on previously unseen data (the so-called 'test error'), not in the causal impact of any of the features.
To this aim, we employ supervised ML techniques. Machine learning is a subfield of artificial intelligence. ML algorithms have been developed in computer science and statistical literature to deal with predictive tasks (Varian 2014). The aim of ML techniques is to minimize the out-of-sample prediction error and generalize well on future data (Athey and Imbens 2019; Mullainathan and Spiess 2017). Supervised ML involves building a statistical model for predicting an output based on one or more inputs (Lantz 2019).
The standard ML routine is to randomly split the original sample into two disjoint sets: the training set, on which ML algorithms are trained, and the testing set, which is used to evaluate the predictive ability of ML models on previously unseen data. This introduces the so-called 'firewall' principle: none of the data involved in generating the prediction function is used to evaluate it (Mullainathan and Spiess 2017). The out-of-sample performance of the model on the unseen held-out data then constitutes a reliable and generalizable measure of the 'true' performance of the models on future data. Following this scheme, we randomly split our dataset in a training set, consisting of 2/3 of the whole sample, and a testing set, composed of the remaining 1/3.
We test the performance of five supervised ML algorithms: classification trees; two ensemble methods based on decision trees, namely bootstrap aggregating (bagging) and random forests; k-nearest neighbour (k-NN); and support vector machine (SVM). These techniques are characterized by different degrees of flexibility and complexity, ranging from the simpler classification tree to black-box models such as SVM and random forest. Higher flexibility comes at the cost of a loss of interpretability. With the exception of classification trees, none of the other methods produces readily interpretable, easy-to-explain outputs to understand how the features are related to the class.
Classification trees are based on recursive partitioning, also known as the 'divide and conquer' approach (Lantz 2019). Via recursive binary splitting, the tree is grown by repeatedly splitting the data into smaller and smaller subsets until sufficient withinsubset homogeneity or a stopping criterion is reached. As trees can suffer from high variance, i.e. they are quite sensitive to small changes in the training sample and prone to overfitting, we also apply bagging and random forest to our classification problem. These ensemble methods build x trees from x bootstrapped training sets and take a majority vote among the x predictions (Hastie et al. 2009). The difference is that for each split in the trees, bagging considers all the features as split candidates, whereas each time a split is considered, random forest randomly subsamples m out of all the p features as candidates each time, thus introducing additional layers of randomness that further decorrelate the trees. k-NN is similar to non-parametric analysis and uses information about an example's k-nearest neighbours to classify unlabelled examples. For each observation in the testing set, the algorithm identifies the k closest observations from the training sample and assigns a prediction on the basis of a majority rule, taking as prediction the most frequent outcome among those of the nearest neighbours. Finally, SVM creates a boundary called hyperplane to divide the multidimensional feature space into homogeneous partitions and is able to model highly complex relationships. For all our algorithms, we use tenfold cross-validation on the training data to tune key hyperparameters and solve the bias-variance trade-off. 4 The number of observations is 9420 households in the training sample, of which 47.3% are resilient and 52.7% non-resilient; and 4854 in the testing sample, of which 47% resilient and 53% non-resilient. After training and tuning the algorithms on the training sample, we evaluate out-of-sample performances in the testing set via confusion matrices in which we compare the predicted and actual values of our binary outcome, resilience status.

Results
The results are reported in Table 1. All the algorithms have an accuracy rate above 72% and an even higher sensitivity. Sensitivity is the proportion of actual positives correctly identified and is the metrics we are mostly interested in. For all the algorithms, sensitivity is close to or around 80%. Classification trees perform comparatively well, especially in terms of sensitivity. More complex methods based on decision trees, bagging, and random forest perform better than the tree for all the metrics, i.e. specificity (the proportion of actual negative cases, y = 0, correctly identified), sensitivity, and overall accuracy, but not significantly so. As for the other two 'black-box' methods, k-NN performs slightly worse than the tree in terms of the accuracy rate but leads to a higher sensitivity, while SVM performs better than the tree but worse than bagging and random forest. Overall, the random forest is the best-performing algorithm.
The classification tree is illustrated in Fig. 1. Five features appear: the (perceived) mean severity of shocks, 5 total gross income, the Household Dietary Diversity Score (HDDS), the agricultural asset index, and household size. Combinations of these five variables produce the tree represented in Fig. 1. For example, if the severity of shocks is higher than 3.4 and household size is lower than 13, the algorithm predicts the household as non-resilient. 6 Conversely, if the perceived severity of shocks is less than 3.4, the resilience status depends on interactions between additional variables other Bold is arguably the most important part of the table than shock severity, such as income, food security, and agricultural asset index. For instance, if shock severity is less than 3.4, but total gross income is equal to or higher than 1585 dollars per year, the household is predicted as resilient. If income, instead, is lower than this threshold, and the HDDS is lower than 4.5, the household is predicted as non-resilient. This assignment mechanism goes on until all the observations are placed within one of the nodes. While no interpretable output is available for k-NN and SVM, bagging and random forest provide a ranking of the predictors. We report the five most important variables according to these algorithms in Table 2.
The score assigned to each variable represents the mean decrease in the Gini Index if that specific variable is excluded from the model. Both bagging and random forest are in agreement with the tree about the predominant importance of the severity of shocks and household income. The agricultural asset index also appears in the top   five. Differently from the tree, bagging and random forest assign a high score to total cultivated land and the household asset index, whereas the HDDS and household size rank lower and are not amongst the most important variables. In sum, households experiencing more severe shocks and endowed with low levels of income and assets tend to be predicted as non-resilient. The fact that the inability to withstand shocks is associated with such features is of course not unexpected, but it is remarkable that based on such a limited amount of information, the algorithms correctly identify up to four-fifths of previously unseen non-resilient households without even knowing the country, region, district, or village of origin of each household. In turn, this makes data-driven resilience prediction via machine learning an appealing tool for targeting and policy purposes, especially in data-scarce environments that are a frequent and recurrent feature of many developing contexts.

Implications and conclusions
Can machine learning be leveraged to predict household resilience? As there is empirical evidence demonstrating that the most common resilience measures have limitations in predicting well-being out-of-sample , we deem this a particularly important question.
In this paper, we perform simple and preliminary tests to show that supervised machine learning algorithms can be successfully employed to predict household resilience status as well as identify the main features that drive such predictions. ML techniques were able to identify over three-quarters of the observations and four-fifths of the non-resilient households. We reckon that this is a noteworthy performance, considering that we did not provide the algorithm with the country of origin or other non-generalizable geo-information. The variables we use as features, in fact, are widely available in most of the micro-surveys from developing contexts. The cross-country nature of our dataset provides more external validity to our findings than predictive studies based on a single-country sample.
The implications for policy targeting are evident: policy interventions in the aftermath of covariate shocks such as conflicts, natural disasters, or economic crises could exploit the potential of these techniques to more accurately target non-resilient households based on the features identified and the thresholds indicated as part of the classification algorithm. Specifically, by providing a specific assignment rule determined by the analyses in question, policy implementers can improve the allocation of financial resources by better targeting resilience-enhancement interventions. This would eventually maximize the potential of these development policies to generate beneficial impacts (the so-called 'treatment effects') for the most affected portions of rural populations.
Central to the debate on resilience-enhancing development projects is how to effectively target the less resilient with policies that can boost adaptive capacity. The implications of a simple ML predictive exercise such as the one we have conducted in this study suggest that policy implementers could exploit the potential of ML to improve the allocation of projects resources and better target resilience-enhancement interventions to those most in need. This, in turn, would eventually maximize the potential of these development policies to generate beneficial effects for the most affected portions of rural populations. In addition to their potential to 'fine-tune' targeting mechanisms, this type of ML-based predictions can also be employed to refine early warning system development (McBride et al. 2021).
For these policy purposes, ML methods that indeed provide a clear, intuitive, and straightforward assignment mechanism or targeting rule, such as classification trees, may be preferred because of their intrinsic simplicity and resemblance to human decision-making, especially when more complex, black-box methods do not perform significantly better, as was the case in our study.
Our work provides new insights on a key notion in development economics by proposing an empirical approach to tackle the identification of household resilience as a "prediction policy problem" (Kleinberg et al. 2015). While our preliminary evidence provides empirical support to recent calls to leverage ML tools to shortlist variables for targeting purposes and highlight hotspots of vulnerability (Jones et al. 2021;Knippenberg et al. 2019), it is far from being conclusive on the matter.
Further work should compare the performance of ML on subjective resilience with the one on objective metrics. More generally, many different resilience approaches and several cut-offs could and should be tested under the ML lens. A comparison of classification approaches with numeric prediction methods, such as regression trees, can also provide valuable insights, especially on the consistency of the best resilience predictors across different models and methodologies. Another crucial test is to check for the stability of ML-based prediction accuracy, which can be a weakness of ML models (in tracking resilience outcomes over time, especially by using high-frequency longitudinal data, along the lines of Knippenberg et al. (2019). Finally, it is key to shed light on the effectiveness and accuracy of actual targeting rules of closed resilience-oriented projects through a comparison with ML predictions in rigorous ex-post targeting evaluation exercises. All these key issues are deferred to future research. to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/ by/4.0/. 'Resilience status' is a binary variable taking value 1 if household average ability to recover is below the sample mean and 0 otherwise. 'Treatment' is a dummy taking value 1 if the household was in the treatment group and 0 otherwise. 'Gender of the household head' is a dummy taking value 1 if the household head is female and 0 otherwise. 'Education level of the household head' is a categorical variable which can take the following values: 0 = no education; 1 = primary education; 2 = secondary education; 3 = higher education. 'Total gross income' and 'Gross crop income' are annual measures expressed in constant 2018 US dollars. 'Total cultivated land' is measured in hectares. The 'Household Dietary Diversity Score' ranges from 0 to 12 and has a reference period of 7 days for all the country samples except China, for which a reference period of 1 day is used. 'Asset Index' and 'Agricultural Asset Index' are standardized measures of assets which range from 0 to 1 and have been generated for each country sample via factor analysis, using exclusively the assets that were common across all the datasets. 'Mean severity of shocks' is the household average, for all shocks, of a self-reported categorical variable indicating the impact of each shock experienced as assessed by the household. The variable ranges from a score of 1 to 5 as follows: No impact = 1; Slight impact = 2; Moderate impact = 3; Strong impact = 4; Worst ever happened = 5