Abstract
A growing attention in the empirical literature has been paid on the incidence of climate shocks and change on migration decisions. Previous literature leads to different results and uses a multitude of traditional empirical approaches. This paper proposes a tree-based Machine Learning (ML) approach to analyze the role of the weather shocks toward an individual’s intention to migrate in the six agriculture-dependent-economy countries such as Burkina Faso, Ivory Coast, Mali, Mauritania, Niger, and Senegal. We performed several tree-based algorithms (e.g., XGB, Random Forest) using the train-validation-test workflow to build robust and noise-resistant approaches. Then we determine the important features showing in which direction they influence the migration intention. This ML-based estimation accounts for features such as weather shocks captured by the Standardized Precipitation-Evapotranspiration Index (SPEI) for different timescales and various socioeconomic features/covariates. We find that (i) the weather features improve the prediction performance, although socioeconomic characteristics have more influence on migration intentions, (ii) a country-specific model is necessary, and (iii) the international move is influenced more by the longer timescales of SPEIs while general move (which includes internal move) by that of shorter timescales.
Similar content being viewed by others
Data Availability
Gallup dataset is a paid dataset that we cannot make available publicly based on its copyrights.
Notes
The major difference between the logistic regression from the ML approach and the regressions used in [6] is that the logistic regression from the ML approach runs a single regression including all features or covariates, while in [6], there are multiple runs of regressions (i.e., a run for each feature).
R-squared can be computed using the McFadden’s R\(^{2}\) formula [12]. Bertoli et al. [6] use R-squared measure implemented in STATA [13]: \(1 - L_M/L_0,\) where \(L_M\) is the log-likelihood of the model and \(L_0\) is the log-likelihood of a null-model. A null-model is a model where we learn only from the target attribute with no predictor.
A dummy variable that represents categorical data.
GADM: the Database of Global Administrative Areas.
There are several ways to configure the year variable: (i) use the integer value for each year, (ii) subtract each year by the minimum year to have relatively smaller numbers starting with 0, and (iii) treat integer as a categorical variable and perform one-hot encoding. Here, we use the last approach.
We used the R package correlationfunnel which is fast and offers visualizations to facilitate this work.
The interviews are conducted in different months for different countries and the month of interview may be different for each year (Fig. 3).
SPEI at 3 months timescale for May 2015 is a function of the sum of the climatic water balance of March, April, and May 2015.
By construction, SPEI has a zero mean and a standard deviation of unity.
To get a SPEI at 12 months timescale with lag 6 for an individual interviewed in May 2015, the SPEI value is the SPEI12 value 6 months ago in November 2014.
We find that the results are similar with permutation feature importance. Refer to Fig. 19 in the Appendix.
Longer timescales (\(\ge \)18 months) referred to https://climatedataguide.ucar.edu/climate-data/standardized-precipitation-evapotranspiration-index-spei.
The international move’s question (Q2) actually asks people if they want to move permanently to another country.
The economic activity in the countries we consider in this article highly depends on the agricultural sector. Knowing that the irrigation infrastructure is lacking and that these sectors are mainly rainfed, the weather conditions contribute greatly to agricultural production and income generation.
References
Beine M, Jeusette L (2018) A meta-analysis of the literature on climate change and migration. J Demogr Econ 1–52. https://doi.org/10.1017/dem.2019.22
Berlemann M, Steinhardt MF (2017) Climate change, natural disasters, and migration-a survey of the empirical evidence. CESifo Econ Stud 63(4):353–385. https://doi.org/10.1093/cesifo/ifx019
Cattaneo C, Beine M, Fröhlich CJ et al (2019) Human migration in the era of climate change. Review of Environmental Economics and Policy 13(2):189–206. https://doi.org/10.1093/reep/rez008
Millock K (2015) Migration and environment. Annu Rev Resour Econ 7(1):35–60. https://doi.org/10.1146/annurev-resource-100814-125031
Black R, Arnell NW, Adger WN et al (2013) Migration, immobility and displacement outcomes following extreme events. Environ Sci Pol 27:32–43. https://doi.org/10.1016/j.envsci.2012.09.001
Bertoli S, Docquier F, Rapoport H et al (2021) Weather shocks and migration intentions in Western Africa: insights from a multilevel analysis. J Econ Geogr https://academic.oup.com/joeg/advance-article-pdf/doi/10.1093/jeg/lbab043/41299221/lbab043.pdf
Gallup (2015) Worldwide research methodology and codebook
Harris I, Osborn TJ, Jones P et al (2020) Version 4 of the CRU TS monthly high-resolution gridded multivariate climate dataset. Sci Data 7(1):1–18. https://doi.org/10.1038/s41597-020-0453-3
Dell M, Jones BF, Olken BA (2014) What do we learn from the weather? The new climate-economy literature. J Econ Lit 52(3):740–98. https://doi.org/10.1257/jel.52.3.740
Tjaden J, Auer D, Laczko F (2019) Linking migration intentions with flows: evidence and potential use. Int Migr 57(1):36–57. https://doi.org/10.1111/imig.12502
Vicente-Serrano SM, Beguería S, López-Moreno JI (2010) A multiscalar drought index sensitive to global warming: the standardized precipitation evapotranspiration index. J Clim 23(7):1696–1718. https://doi.org/10.1175/2009JCLI2909.1
McFadden D et al (1973) Conditional logit analysis of qualitative choice behavior. University of California, Institute of Urban and Regional Development
StataCorp L et al (2007) Stata data analysis and statistical software. Special Edition Release 10:733
Branco P, Torgo L, Ribeiro RP (2016) A survey of predictive modeling on imbalanced domains. ACM Comput Surv 49(2). https://doi.org/10.1145/2907070
Provost F, Fawcett T (2013) Data science for business: what you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc
Athey S, Imbens GW (2019) Machine learning methods that economists should know about. Annu Rev Econom 11(1):685–725. https://doi.org/10.1146/annurev-economics-080217-053433
Mullainathan S, Spiess J (2017) Machine learning: an applied econometric approach. J Econ Perspect 31(2):87–106. https://doi.org/10.1257/jep.31.2.87
Athey S (2018) The impact of machine learning on economics. University of Chicago Press, pp 507–547. https://doi.org/10.7208/chicago/9780226613475.001.0001
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd Edn. Springer Series in Statistics, Springer. https://doi.org/10.1007/978-0-387-84858-7
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In: Krishnapuram B, Shah M, Smola AJ et al (eds) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016. ACM, pp 785–794. https://doi.org/10.1145/2939672.2939785
Snoek J, Rippel O, Swersky K et al (2015) Scalable Bayesian optimization using deep neural networks. In: Bach FR, Blei DM (eds) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015, JMLR Workshop and Conference Proceedings, vol 37. JMLR.org, pp 2171–2180. https://dl.acm.org/doi/10.5555/3045118.3045349
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874. https://doi.org/10.1016/j.patrec.2005.10.010
Swets JA (1988) Measuring the accuracy of diagnostic systems. Science 240(4857):1285–1293. https://doi.org/10.1126/science.3287615
Breiman L, Friedman J, Stone C et al (1984) Classification and regression trees. The Wadsworth and Brooks-Cole statistics-probability series. Taylor & Francis. https://doi.org/10.1201/9781315139470
Cattaneo C, Peri G (2016) The migration response to increasing temperatures. J Dev Econ 122:127–146. https://doi.org/10.1016/j.jdeveco.2016.05.004
Duan L, Street WN, Liu Y et al (2014) Selecting the right correlation measure for binary data. ACM Trans Knowl Discov Data (TKDD) 9(2):1–28. https://doi.org/10.1145/2637484
Vicente-Serrano SM, Beguería S, López-Moreno JI et al (2010) A new global 0.5 gridded dataset (1901–2006) of a multiscalar drought index: comparison with current drought index datasets based on the palmer drought severity index. J Hydrometeorol 11(4):1033–1043. https://doi.org/10.1175/2010JHM1224.1
Eslamian S (2014) Handbook of engineering hydrology: environmental hydrology and water management. CRC Press. https://doi.org/10.1201/b16766
Wilhite DA, Svoboda MD (2000) Drought early warning systems in the context of drought preparedness and mitigation. Early warning systems for drought preparedness and drought management. Geneva: World Meteorological Organization, pp 1–21
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106. https://doi.org/10.1007/BF00116251
Djamba YK (2003) Gender differences in motivations and intentions to move: Ethiopia and South Africa compared. Genus 59(2):93–111
Athey S (2015) Machine learning and causal inference for policy evaluation. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp 5–6. https://doi.org/10.1145/2783258.2785466
Athey S, Tibshirani J, Wager S et al (2019) Generalized random forests. Ann Stat 47(2):1148–1178. https://doi.org/10.1214/18-AOS1709
Mayda A (2010) International migration: a panel data analysis of the determinants of bilateral flows. J Popul Econ 23:1249–1274. https://doi.org/10.1007/s00148-009-0251-x
Fisher A, Rudin C, Dominici F (2018) Model class reliance: variable importance measures for any machine learning model class, from the Rashomon perspective, vol 68. Preprint at http://arxiv.org/abs/1801.01489
Acknowledgements
We thank Professor Frédéric Docquier for the reviews, suggestions, and earlier discussions of the project. The authors thank Dr. Shari De Baets for her feedback and comments.
Funding
This research is supported by the ARC Convention on “New approaches to understanding and modeling global migration trends” (convention 18/23-091). This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 754412. Open access funding provided by the University of Skövde.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1 Machine Learning Approaches
1.1 1.1 Data Preprocessing
We use the sample dataset in Table 4 as an illustrative example. This dataset has four features: age, household size (“hhsize”), having human network abroad (“mabr”), and the intensity of the drought (“drought”); and one target attribute representing the migration intention (“move”).
The first step is data preprocessing. It allows cleaning up the data by handling missing data and scale/type-related problems. A scale-related issue occurs when variables are displayed in different scales, for example, year (e.g., [2010, 2016]) and age (e.g., [0, 100]). This problem can cause bias on ML models’ output and implementation inefficiency.
There are two types of variables, numerical and categorical variables, that may need preprocessing. The categorical variables contain labels instead of numerical values. Many ML algorithms only support numerical variables, often for the sake of implementation efficiency. Hence, it is recommended to convert these variables into numerical variables using one-hot encoding.
Definition
(One-hot encoding) It consists of creating new binary variables for the unique labels in the categorical variable.
It is well known that it produces bias to the model output when using the numerical variable inputs with different scales. We overcome this problem by binarizing these numerical variables.
Definition
(Data binarization) It comprises transforming a numerical variable into several binary variables. The binarization workflow is in two steps: (i) split the numerical variable into intervals and create a categorical variable by labeling each range. Then, (ii) use the one-hot encoding method to create the binary variables.
Example
Figures 12 and 13 show examples of binarization and one-hot encoding for the age and SPEI12 variables.
Generalization is an essential concept in ML. It refers to the ability of a method to classify unknown examples to the model correctly. For this, the dataset is split for training and testing the model in the data preprocessing step.
Definition
(Training set and test set) The training set is a part of the dataset used to train the model and the test set is the hold-out part of the dataset to test the model. Typically, 60 to 90% of the database is assigned as a training set while the rest as a test set.
To have a noise-free and robust model that generalizes well, the training and test sets are extracted iteratively from the dataset. This resampling procedure is called the cross-validation process.
Definition
(Cross-Validation) The cross-validation process consists of randomly splitting the dataset into K fairly equal samples \(S_1, S_2, \cdots , S_K\). Based on these samples, K folds are created, each containing training and testing sets. At the ith fold, the samples \(S_1, S_2, \cdots , S_K\), excluding \(S_i\), are merged to a training set and sample \(S_i\) is used as a testing set.
Example
Figure 14 shows an example of the second fold.
1.2 1.2 Tree-Based Approaches
Decision tree method approximates the learning function f using decision trees.
Definition
A decision tree represents a set of conditions that satisfies the classification of instances. Paths from the root to the leaf represent classification rules.
Example
Figure 15 is an example of a decision tree using the sample dataset.
Decision tree algorithms classify instances from the root to the leaves by providing a classification for each instance to the leaves. Each node represents a test on the features, and each branch corresponds to a potential value of a feature.
Example
In the tree in Fig. 15, age is the root node. This node has three branches (young, middle, and old) representing the age values. The first leaf on at the leftmost of the tree represents all instances where individuals are young, and the drought is harsh, where people have a moving intention (move).
A decision tree is built by selecting the variable at each node that gives the best data split. This split is based on the measure of the impurity rate (obtained by calculating, for example, the entropy or the Gini index) of each variable. The best variable is the one with the lowest impurity rate. Typically, this measure favors splits that allow having the dominant or (strongly) discriminative label over the target attribute.
It is possible to represent a decision tree as a linear function [17]. This is closer to the way that social scientists represent a model. To do so, we represent each leaf of the tree as a variable (feature) of the linear model. This variable is the product of decisions from the root to the leaf. This model thus contains as many variables as there are leaves in the tree. These variables show how decision trees take into account the nonlinearity of the problem automatically.
Example
Let \(L_1, L_2,\cdots ,L_5\) be the variables of the linear model. These variables represent the leaves of the tree in Fig. 15 (from left to right of the tree). The leftmost leaf variable \(L_1\) is equal to \(L_1 = 1_{\text {age = young}\wedge \text {drought = harsh}}\). The variables \(L_3\) and \(L_5\) are equal to \(L_3 = 1_{\text {age = old}}\), and \(L_5 = 1_{\text {age = middle} \wedge \text {mabr = no}}\). Accordingly, the outcome (y) follows:
As in the example, building and using decision trees (DT) are straightforward and explainable. However, in practice, they might be inaccurate [19]. Thus, several other tree-based methods have been proposed. Random Forest (RF) [20] and eXtreme Gradient Boosting (XGB) [21] methods are well known and widely used.
Definition
(Random Forest) Random forest consists of several decision trees that operate together as an ensemble. This ensemble of trees is called a forest. Each tree classifies an instance in the forest, and the class label of this instance is decided by a majority vote. Each tree is built on a randomly selected (with replacement) sample of the dataset and a random number of features.
Example
With the DT example in Fig. 15, instance 1 from our sample example is classified as the class label number (i.e., the individual with instance number 1 has an intention to move). With RF that contains five trees, we classify this instance with each tree and take the majority-class label. Assuming that we have these predictions {Yes, Yes, Yes, No, No}, RF classifies this instance as Yes.
Random forest considers the predictions of each tree to have the same weight. By contrast, XGB does not make this assumption, thus, dynamically assigns a certain weight to each tree and instance. At each step of the forest construction, a new tree is added to address the errors made by the existing trees.
By constructing decision trees, one may wonder how deep it needs to go to achieve a better classifier. For a forest, how many trees does it need and how many features must be selected? Basically, in ML, these parameters are determined dynamically by trying several sets of parameters. This process is called parameter tuning. In this paper, we used Bayesian Hyperparameter Optimization (BHO) [22].
Definition
(Bayesian Hyperparameter Optimization) It consists of testing the models on several parameters and associating each set with a probability to obtain the best performance. A Bayesian model (i.e., probability model) is then used to select the most promising parameters.
1.3 1.3 Performance Evaluation
In supervised learning, models are evaluated by making one-on-one comparisons between the predicted outcome (\(\hat{y}\)) and the real outcome (y). This is a benefit of ML over parameter estimation, where the estimation is usually based on the assumptions made from the data-generating process to ensure consistency [17].
For comparison, in ML, we typically build a confusion matrix.
Definition
(confusion matrix) A confusion matrix is a matrix that compares the predicted values to the ground-truth. It contains four values, namely true positive (actual observation “Yes” and predicted “Yes”), false positive (actual observation “No” but predicted “Yes,” false alarm), true negative (actual observation “No” and predicted “No”), and false negative values (actual observation “Yes” but predicted “No”).
Example
Figure 16 shows the predicted move intention using the decision tree (DT) and the confusion matrix comparing these predictions to the observed (actual) move intention.
Based on the confusion matrix, various performance metrics can be computed. The common ones are accuracy, precision, and recall.
Definition
(Accuracy - Precision - Recall) The accuracy is a ratio of correctly predicted observations to the total number of observations. It is an intuitive measure, but only when false positive and false negatives are not too different. Instead, precision shows the ratio of correctly predicted positive observations to the total predicted positive observations, while recall is the ratio of correctly predicted positive observations to all accurate (or true) observations. The formulas are available in Fig. 16 with the confusion matrix. These measurements have values between 0 and 1 (the higher, the better performance).
Predicted class labels typically involve a user-defined threshold (e.g., 0.5). By convention, the probability lesser or equal to the threshold is considered as a “No” and otherwise a “Yes.” Differently defined threshold leads to different predictions. The Area under the ROC (Receiver operating Characteristics) curve (AUC) [23, 24], another model performance metric, is used to evaluate the performance regardless of any classification threshold.
Definition
(ROC and AUC) A ROC curve, a two-dimensional graph, is generated by plotting the false-positive fraction (x-axis) against the true-positive fraction (y-axis) of a model for each possible threshold value. The ROC curve shows how well a model classifies binary outcomes. The AUC (Area under the curve), as its name implies, is the area under the ROC curve. Typically, it is computed when a single value is needed to summarize a model’s performance to undertake comparisons. The AUC value is also between 0 and 1 (the higher, the better performance).
Example
Figure 16 illustrates the ROC curve and the AUC of a decision tree (DT). The AUC of this classifier is 0.89 (i.e. classifier performs well).
In this paper, we mainly use AUC and precision to determine which method to focus on.
1.4 1.4 Output Interpretation: Feature Importance and Partial Dependence Plots (PDP)
The features X used to estimate f in the equation \(f(X) = y\) are rarely equally relevant. Typically, only a small subset of the features is relevant. Hence, after training the model, the Relative Feature Importance (RFI) method is used to determine the relevant features. RFI was introduced by [25] for tree-based learning methods.
Definition
(RFI) RFI consists of, (i) for each internal node of a tree T, compute the contribution of each feature to the prediction, (ii) then sum its contributions for each feature, and (iii) arrange the features accordingly.
To calculate the importance \(I_j\) of the feature j (at node j) in a decision tree (A2), five elements are needed: the numbers of “Yes” (\(w_j^{Yes}\)) and “No” (\(w_j^{No}\)) instances, the total number of instances (\(w_j = w_j^{Yes} + w_j^{No}\)) at node j, the contribution of j (\(c_j =\! \sum _{i \in \{\text {Yes, No}\}} w_j^{i}\)), and the importance of node j (\(n_j = w_j c_j - \sum _{k \in \{\text {children of }j\}}w_k c_k \)).
Example
Figure 17 shows how we compute the five elements needed to compute the importance of the feature age, which results in 0.977 using (A2):
In a single decision tree, it is clear that the most important feature is the feature at the root node. In a forest, (A2) is generalized as follows:
RFI has become widespread and is used for other ML methods. In order to understand how these important features influence the outcome y, one uses the Partial Dependency Plots [19], Chap. 14].
Definition
(Partial Dependence) Assume the features \(X = X_1, X_2,\cdots , X_p\), indexed by \(P = \{1, 2,\cdots ,p\}\). Let S and its complement R be subsets of P, i.e., \(S,R \subset P \wedge S\cup R = P \wedge S\cap R = \emptyset \). Assuming that \(f(X) = f(X_S, X_R)\), the partial dependence of f(X) on the features \(X_S\) is,
This is a marginal average of f describing the effect of a chosen set of features S on f. It is approximated as the average over the N instances in the training set (X) of the prediction of each instance (\(x_{iR}\)) occurring in the complementary set \(X_R\).
The computation of (A5) requires a pass over the data for each set of joint values of \(X_S\). This can be computationally intensive, and therefore, the partial dependency is usually not calculated with more than three features. Fortunately, partial dependence with only one feature is often informative enough, and it simplifies the calculation with a discrete feature. In practice, for a discrete feature with two class labels “yes” and “no,” we only compute \(PD_{S}(X_S = yes)\) and \(PD_{S}(X_S = no)\).
Example
Figure 18 shows how we compute the partial dependence in DT (Fig. 15) on a feature “mabr” (human network abroad).
From the different values used to calculate the partial dependence, we can draw a chart with the tested values in x-axis against the partial dependence output in y-axis. The plot’s role is to show in which direction (towards label ’Yes’ or ’No’) each feature value drives the outcome y. The plot visualizes the effect of a feature related to the average effects of other features.
Appendix 2 Additional Figures
Figure 19 shows male and age as top influencing features according to the permutation feature importance measures, similar to the results from the Relative Feature Importance (RFI) method. We also observe international move is more affected by longer SPEIs (e.g., 18, 24) while general move is affected by shorter SPEIs (e.g., 2, 3, 12) which aligns with previous findings. Darker box plot shows the uncertainty from the permutations. Permutation feature importance measures the increase of a model’s prediction error after a certain feature’s value is permuted. The permutation breaks the relationship between the feature and the true outcome. A feature is considered “important” if the change of a feature value increases the model error since it means that the model relies on that feature for prediction. Fisher et al. [36] proposed “model reliance” measures and a model-agnostic permutation feature importance algorithm.
Figure 20 shows the feature importance distributions of the six countries targeting international move over the seven SPEI timescales (i.e., 1, 2, 3, 6, 12, 18, 24) and 49 lags (i.e., 0–48).
Appendix 3 Terminology Comparison
Table 5 compares the common terminology used in social sciences and machine learning.
Appendix 4 GWP Questions
Table 6 describes the World Poll questions used to measure the opinions of the interviewees.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Aoga, J.O.R., Bae, J., Veljanoska, S. et al. Impact of Weather Factors on Migration Intention Using Machine Learning Algorithms. Oper. Res. Forum 5, 8 (2024). https://doi.org/10.1007/s43069-023-00271-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43069-023-00271-y