Skip to main content

Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests

Abstract

As researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with more traditional regression methods. To develop insights into how the important variables identified by the random forest algorithm impact migration, we performed a survival analysis of household time to first migration. With this combined analysis, we found that variables related to wealth and household composition are important predictors of migration. Such multi-methods approaches may help to shed light on factors contributing to migration and non-migration.

Introduction

The complexity of processes influencing human migration poses a challenge for researchers who aim to study the interactions between environmental changes and migration (McLeman 2013). Migration in the form of a decision to move or stay is often influenced by a combination of political, social, economic, and environmental drivers, and the dynamics of this combination are unclear and difficult to quantify (Adams and Kay 2019; Black et al. 2011). Environmental migration, which focuses on the influence of environmental conditions and changes on migration, is especially complex, so prediction may be highly uncertain (Gemenne 2011). This uncertainty is further exacerbated by the uncertainty related to future climate and socioeconomic scenarios (Hugo 2011) and the difficulty of predicting human decision-making (Subrahmanian and Kumar 2017). Yet, accurate modeling and prediction of environmental migration are critical for informing future policy and adaptation strategies, especially as the impacts of climate change continue to increase (Stern 2006; Piguet 2022; Hugo 1996; Biermann and Boas 2010; Black et al. 2011; Boas et al. 2019; Ahsan et al. 2011).

Questions remain about how to best model environmental migration and how to obtain appropriate and accurate data to test these models (Neumann and Hilderink 2015). Current work studying environmental migration uses a wide range of methods and models from strictly conceptual models (Perch-Nielsen et al. 2008; Renaud et al. 2011), to logistic regression (Koubi et al. 2016), multi-variate regression (Hino et al. 2017), and other forms of regression modeling (Henry et al. 2003, 2004), and agent-based models (Cai and Oppenheimer 2013; Hassani-Mahmooei and Parris 2012; Kniveton et al. 2011; Silveira et al. 2006; Smith 2014, Klabunde et al. 2015; Thober et al. 2018). Identifying appropriate data sources is an additional challenge to studies of environmental migration, and there is no agreement about what data are best (Tejero et al. 2020). For example, Fussell et al. (2014) advocate for using a combination of population censuses, surveys, and multi-level modeling. Recently, Lu et al. (2016) utilized mobile phone data from more than six million anonymous phone users in Bangladesh to track movement across short time scales. Household surveys have been a common source of data for migration research (Bilsborrow and Henry 2012), and some researchers claim survey data are the most appropriate level for obtaining information about the causes of migration (Neumann and Hilderink 2015).

Several reviews of existing methods and challenges call for the exploration of new methods that can improve prediction and better address nonlinearities in environmental migration (Neumann and Hilderink 2015; Obokata et al. 2014; Piguet 2010). As Obokata et al. (2014) suggest, existing quantitative methods of studying environmental migration often simplify complex variables and limit the number of variables studied (Obokata et al. 2014). The emergent theory of voluntary non-migration, or the decision to remain in place, further complicates the conceptual understanding of how environmental stress may increase or dampen migration (Adams 2016; Mallick and Schanze 2020). Because of this complexity, researchers who study migration will often use expert judgement or theory to select which variables to assess. Though this approach can be useful to test theoretically motivated hypotheses and provide insights into how specific drivers might impact migration decisions, it does little to identify which variables might be the most important at driving decisions, especially when considering nonlinear interactions among variables.

As researchers continue to collect large amounts of data with household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. To advance the study of environmental migration and non-migration, especially as large datasets and surveys become more readily available, new methods will need to be employed (Neumann and Hilderink 2015). This work aims to address this need by applying machine learning, specifically random forests, to social survey data for the study of environmental migration in Bangladesh. Random forest is a machine learning approach that has been shown to perform well in environmental and ecological contexts (Cutler et al. 2007; Prasad et al. 2006). However, reviews of methodologies used in studying environmental migration did not mention machine learning techniques (Piguet 2010), and to our knowledge, our application of random forest methods to the topic of environmental migration is novel.

In this work, we present machine learning as a potential tool for social scientists studying environmental migration and non-migration and we describe a case study in which we used, random forests to determine the importance of each covariate in a large dataset for predicting migration outcomes. Though random forest models are able to identify correlates of migration, there exists a tradeoff between high predictive ability and low interpretability. To address this tradeoff, random forests and other complex machine learning algorithms may be especially useful in combination with more traditional, simpler methods. We conduct a survival analysis of household time to first migration using a subset of important variables identified by the random forest algorithm, which provides deeper insight into how important variables impact migration. This multi-methods approach of random forest models and survival analysis provides a data-driven method for identifying and further investigating key variables that impact migration from social datasets.

Machine learning

Machine learning, broadly, refers to a variety of methods that enable a computer or “machine” to automatically recognize patterns in data and use these patterns to build and refine a statistical model of the data without being explicitly programmed to do so and without theoretical or phenomenological preconceptions about the causal mechanisms that gave rise to the data. Machine learning methods are often categorized as supervised or unsupervised. Supervised methods are used to predict one or more specified dependent variables. Unsupervised methods are used to identify patterns in the data (Jordan and Mitchell 2015). To give examples from common statistical methods, regression analyses are supervised methods and exploratory factor analyses are unsupervised methods. In order to guard against overfitting, machine learning models are trained using a subset of the complete data, known as the training set, while the remaining data, known as the holdout or testing set, is withheld and used for validating the model’s performance after the model is fully trained.

Machine learning techniques can outperform standard regression analysis in predictive ability, especially when studying complex social problems (Hindman 2015). Recently, there has been discussion of broadly incorporating machine learning into the social sciences, especially in the place of traditional regression analysis (Hindman 2015; Mason et al. 2014). However, some machine learning algorithms can be very difficult to interpret due to their complexity and this complexity makes it difficult to assess how well a machine learning model is likely to apply outside the specific context in which the data was gathered (Buolawmini and Gebru 2018). While a traditional regression results in coefficients that can be easily interpreted, a more complex machine learning model may be “black box,” making it difficult to draw insights from the model. As the complexity of the model increases, interpretability may decrease, representing a tradeoff between model performance and interpretability (Fig. 1).

Fig. 1
figure 1

Schematic demonstrating the tradeoff between complexity and interpretability of common machine learning algorithms. For example, ensemble tree-based methods such as random forests are highly complex and sometimes challenging to interpret. Researchers should consider where a method falls on this continuum along with specific research goals when selecting an appropriate algorithm

Where the predictive power of the model is a priority, complex machine learning algorithms may perform very well. Yet, they are a less appropriate tool for theory development or testing specific hypotheses. The greater predictive power that complex models often possess may arise from models reflecting details of the context in which the data set being analyzed was collected and the models may not transfer as well to other contexts as simpler or theory-driven models would. When the complexity of a model impedes interpretation, it can be difficult to draw on theory or other domain knowledge of the context to evaluate the applicability of a machine learning model to different contexts. Therefore, it is especially important for researchers to carefully consider the goals of their research when selecting a machine learning algorithm, as there is no one size fits all approach.

Nevertheless, machine learning can complement more traditional theory-driven approaches and may have advantages, especially where theory is unclear. Machine learning should be incorporated into social scientists’ toolkits for studying migration because of its ability to identify patterns in complex datasets. We demonstrate one such case study where machine learning—specifically random forest models—are useful in identifying salient variables in a large, complex social survey dataset from Bangladesh.

Case study: migration in Bangladesh

Bangladeshi context

Bangladesh is a country located on the floodplain of the Ganges–Brahmaputra-Jamuna Delta, one of the largest river deltas in the world (Passalacqua et al. 2013). Bangladesh faces environmental vulnerabilities such as flooding and waterlogging, cyclones, and rapid river erosion and accretion (Dewan et al. 2007; Hallegatte 2013; Higgins et al. 2014; Islam and Sado 2000; McGranahan et al. 2007). Bangladesh is also considered one of the most vulnerable countries to climate change (Black et al. 2008; Walsham 2010). Future climate change is expected to create additional environmental stress and uncertainty in the future (Ackerly et al. 2015; Auerbach et al. 2015; Benneyworth et al. 2016; Brammer 2014; Nicholls et al. 2007, 2008; Tessler et al. 2015; Xu et al. 2009).

In Bangladesh, migration is common as a method of livelihood diversification and adaptation (Alam et al. 2017; Bryan et al. 2014; Amrith 2013; Black et al. 2005; Martin et al. 2014). Because of the combined complexity of the human and natural systems, it is unclear how patterns of migration are influenced by environmental change and may be influenced in the future. To begin to address these uncertainties, environmental migration has been widely studied in Bangladesh (Afsar 2003; Ahsan et al. 2011; Bell et al. 2021; Call et al. 2017; Carrico and Donato 2019; Chen and Mueller 2018; Donato et al. 2016; Gray and Mueller 2012; Islam 2017; Joarder and Miller 2013). Some studies in Bangladesh have focused on the impacts of extreme weather events such as cyclones or floods on migration (Kartiki 2011; Gray and Mueller 2012; Lu et al. 2016; Mallick and Vogt 2014). For example, Mallick and Vogt assess “disaster-induced population displacement” in the context of the 2009 cyclone Aila in Bangladesh (Mallick and Vogt 2014). They found that male household members tended to migrate towards cities to access livelihood opportunities after the cessation of emergency aid (Mallick and Vogt 2014). In contrast, Gray and Mueller (2012) found that flood events did not influence migration and that crop loss did, but in a complex manner: if crop loss affected only a small number of households in a community, out-migration would decline, whereas if crop loss affected many households, out-migration would rise, with higher status and more affluent households more likely to migrate than lower status and less affluent ones.

Other research has considered slower onset environmental change such as salinity encroachment, temperature change, and changes in precipitation (Call et al. 2017; Carrico and Donato 2019; Chen and Mueller 2018; Perch-Nielsen et al. 2008). Call et al. studied the impacts of temperature, precipitation, and flooding on temporary migration in a non-coastal area, Matlab, Bangladesh (2017). Their work showed that temporary migration declines immediately after a flood, but quickly recovers, while high temperatures consistently increase temporary migration, and precipitation has a strongly nonlinear effect on migration rates (Call et al. 2017). This work supports other research that has indicated that environmental stress could decrease migration and limit the effectiveness of migration as an adaptation strategy (Adger et al. 2015; Gray and Mueller 2012; Bennett et al. 2011). In a more recent study, Carrico and Donato (2019) find that prolonged dry periods, warm periods, and increases in precipitation in Bangladesh may increase migration, especially for households with agricultural livelihoods.

Even within the literature on environmental migration in Bangladesh, there is disagreement in terms of the potential of migration to be a positive adaptation strategy to environmental stress. Though temporary migration is common in Bangladeshi communities, some authors have asserted that permanent migration due to environmental stress may be a last resort for households whose environment becomes inhospitable, potentially suggesting that voluntary non-migration may be influencing such communities’ decision-making (Penning-Rowsell et al. 2013).

Data

Household survey data used in this analysis was collected in the southwest region of Bangladesh by the Bangladesh Environment and Migration Survey (BEMS) in 2014. This survey contains migration, employment, and livelihood histories on more than 3000 individuals affiliated with 1695 households. The data represents 1695 randomly sampled households in nine sites in Bangladesh, which were surveyed in 2014. The survey specifically asks for histories of migration within Bangladesh, to India, and to any other country (Donato et al. 2016). Here, we focus only on each household’s reported migrations internal to Bangladesh. The original dataset consists of 1695 observations of 1997 distinct variables.

The survey asks respondents to recall the total number of migrations that any member of the household has made, without attributing underlying motivation. This provides the total number of migration trips per household, normalized by total person-years. Person-years were calculated for each member of the household, beginning at age 11, which is the age that many Bangladeshis begin migrating for livelihood opportunities, until 2014 when the survey was collected (Donato et al. 2016). Our analysis takes as its dependent variable this number of trips per person-years, which may be interpreted as the annual probability of making a migration. This is represented as a continuous variable at the household level.

Random forest models

Random forest models are an ensemble method of decision trees and represent a subset of machine learning known as tree-based methods. Tree-based methods, including random forests, can be used for the classification of discrete outcome variables, or regression of continuous variables. They are especially powerful tools when there are strong nonlinearities or interactions between variables in the data.

Random forests models work by fitting many decision trees, where each tree uses a random subset of the predictor variables at each split in its decision tree. The final prediction is then calculated by averaging across the outputs of all of the individual decision trees (Hastie et al. 2009, Ch. 15). This allows random forest models to achieve high predictive accuracy without overfitting (James et al. 2013). One strength of random forest models, especially over other “black box” statistical models, is their ability to assess variable importance and account for complex, nonlinear interactions between variables. Random forest models are also able to use combinations of categorical, ordinal, and continuously valued variables as inputs without requiring dummy variables or scaled data. This makes them especially appealing tools for analyzing large social surveys and studying complex challenges such as migration. However, it can be difficult to interpret a random forest model. The ensemble of trees, each with a different subset of predictor variables, makes it impractical, if not impossible, to establish a clear or descriptive relationship between independent and dependent variables. Thus, while these models are powerful, they are very much black boxes in comparison to the ways we can understand and interpret regression or single-tree models. Random forest models were chosen for this analysis because we found in previous work that they outperformed linear regressions and support vector machines in predicting the migration outcome for this data set (Best et al. 2020).

Random forests allow us to rank variables by their importance (i.e., their contribution to the overall model performance) (Hastie et al. 2009). For regression random forest models, importance is calculated by node impurity, which is a calculation of how much a split in the decision trees by a specific variable can decrease variance in the outcome (Hastie et al. 2009). Variable importance cannot be meaningfully compared across different datasets or different models, but is useful for comparing the significance of variables within a specific model trained on a specific dataset.

We fit 10 models to the survey data. For each model, we divided the data, randomly assigning 80% of the household responses to a training set and the remaining 20% to a testing data set. We used the randomForest package in R to fit a complete random forest regression model to each of the 10 training data sets and evaluated its out-of-sample predictive performance on the corresponding testing data set (Cutler et al. 2018). The regression models predicted the continuous outcome variable of total internal migration trips per household normalized by person-years (Donato et al. 2016). For each of the 10 models, the parameter for the number of variables randomly sampled at each split was tuned by minimizing the out-of-sample error using the tuneRF function in the randomForest package. After tuning, 10 complete models were fitted using the optimum tuning parameters. Variable importance was ranked and averaged across the 10 complete models, each model’s predictive performance was assessed using its testing data set (the 20% of data not used for training). A full explanation of the random forest methods and results, including model performance metrics, can be found in Best et al. 2020.

To further validate the ranked variable importance from our random forest models beyond Best et al. (2020), we divided the complete survey dataset into five groups, each consisting of 20% of the data, and conducted a fivefold cross-validation where each fold chose a different one of the five subsets of data to use as the validation set for a random forest model fit to the other four subsets. We then compared the predictive performance of a model using all the variables and another random forest model using just the top 15 variables identified from the training set when fit to the holdout validation set. We found that, across the five models in this fivefold cross-validation, the top 10 variables of importance were consistent, and there was a slight movement in the bottom five variables between models. We found that models using all the variables had a mean RMSE of 2.65, while the models using just the identified top 15 variables had an average RMSE of 2.67. This is consistent with our understanding that the top 15 variables are robust and account for almost all of the model performance across different subsets of the data.

Survival analysis

Survival analysis is a technique used to study the occurrence of a discrete event where the time until the event matters (Harrell 2015). The response variable in survival models is time until the event, usually referred to as failure time, survival time, or event time. Survival analysis has been widely used in biomedical research to describe times to a disease event (Bull and Spiegelhalter 1997; Crowley and Hu 1977; Prentice et al. 1981), failure or recovery times in engineering systems (Ansell and Philipps 1997; Barker and Baroud 2014), and binary events in demography and the social sciences, including the timing of a woman’s first child (Teachman 1983) and when people make a first migrant trip (Donato et al. 1992). Survival analysis also allows for some responses to be incomplete, meaning that the event of interest has not occurred within the observed time. Such responses are censored, and responses for which the event of interest did occur within the study time are uncensored.

For the survival analysis, time in person-years from age 11 to first internal migration by the head of the household was used. This generated a discrete-time person-year file that followed the male head of the household. The age of 11 was chosen as the starting point because this is the age at which many Bangladeshi males begin engaging in paid work. For each year from age 11 to the date of the survey, each male head of household received a 1 if they did complete a trip and a 0 if they did not complete a trip. In this way, the individual migration data was divided into censored and uncensored data for a survival model, as some heads have not completed their first migration by the time of the data collection. Only a small minority of heads of household had ever migrated, so 17.3% of the data was uncensored and the remaining 82.7% was censored.

We used Cox proportional hazards models to estimate the survival and hazards function corresponding to the probability of internal migration and to assess the relative effects of the different covariates (Ansell and Philipps 1997; Harrell 2015). The Cox model is a semi-parametric proportional hazards model, but the regression portion of the model is parametric and assumes that covariates are linearly related to the log of the hazard. This approach is ideal when data is not easily fit to a distribution and when the form of the true hazard function is complex. It is also a useful approach when the key question of concern is how covariates impact the hazard, rather than the shape of the hazard itself (Harrell 2015).

However, Cox models, like most proportional hazards models, can only represent monotonic relationships between covariates and hazard, whereas tree-based models, such as random forests, can represent arbitrarily complex nonlinear and non-monotonic relationships. Thus, if a covariate identified by the random forest models has a non-monotonic relationship to migration, a Cox model will perform poorly with that covariate.

Results

Salient variables from random forest models

The fitted random forest models provided a rank order of variable importance, which were averaged across all 10 models. The results of the variable importance assessment from the random forest model of the survey data have been highlighted in previous work (Best et al. 2020) and are given in Table 1. The 15 most important variables are presented in order of descending model importance. The range of variable importance rank across the 10 variables is available in Supplementary Materials (Figure S1).

Table 1 Variables of importance identified by random forest model of migration and original survey questions

Survival analysis

Univariate Cox proportional hazards models were fit for each of the salient variables in Table 1 identified by the random forest models. For each univariate model, the estimated value of the coefficient “Beta” and the estimated hazard ratio (HR) and 95% confidence interval boundaries are presented (Table 2). In Table 2, we also present the concordance statistic, which is a measure of predictive ability for survival analysis which measures the proportion of pairs of observations in which predictions and outcomes agree (Harrell et al. 1996). While concordance is a common method of measuring predictive ability in survival analysis, we also present the R2 value and the p-value as commonly employed and widely understood measures of model performance (Table 2).

Table 2 Results of univariate Cox proportional hazards models with each salient variable identified by the random forest models. For each univariate model, the fitted coefficient Beta is presented, along with the hazard ratio (HR) and 95% confidence intervals for HR, the generalized R2, concordance statistic, and p-value

The hazard ratio describes how a covariate impacts the hazard (whether it has a positive or negative effect) (Harrell 2015). The hazard ratio for a covariate is calculated by computing the ratio of the hazard for that covariate over the baseline hazard. Therefore, a hazard ratio of 1 indicates that the covariate has no effect on the hazard. A hazard ratio less than 1 means that the covariate reduces the hazard of an event, and a hazard ratio greater than 1 means that the covariate increases the hazard from the baseline.

While we do not employ an arbitrary p-value significance threshold, the variables “Latitude,” “Longitude,” and “Who owns water source” have large p-values which are greater than 0.2 and orders of magnitude greater than the p-values for all other variables. This led us to conclude that these are not useful predictors, so we excluded them from the analysis going forward. Furthermore, the uncertainties in regression coefficients for the variables related to the most recent cyclone, female toilet, and water source were very large, with 95% confidence intervals that include the hazard ration of 1. This means that we cannot be confident that these variables affect the survival function, so we excluded them from the continued analysis.

Next, a series of nested Cox proportional hazards models were developed with the remaining variables by starting with a univariate model and systematically adding an additional significant covariate to the model (Table 3). The hazard ratios for the covariates of the complete model are given in Fig. 2.

Table 3 Nested Cox proportional hazards models of increasing complexity, generalized R2, and concordance
Fig. 2
figure 2

Hazard ratios for the final Cox proportional hazards model. A hazard ratio greater than 1 (to the right of the dashed line) indicates that the variable increases mobility, while a hazard ratio less than 1 (to the left of the dashed line) indicates that the variable decreases mobility

Discussion

Random forests applied to migration

Variables of importance were identified using random forest models to identify patterns in the data (Table 1, Best et al. 2020). No researcher judgement or selection of variables from the large survey dataset was required. This work demonstrates that random forest models can help researchers identify salient variables from large social surveys when studying migration. This is especially useful when dealing with large, complex datasets from social surveys, where it can be challenging to decide which variables are worthwhile for further investigation. In this work, random forest models were able to identify the most important predictors of migration from an original set of approximately 2000 total predictors. Thus, the random forest served as a method of variable reduction which allowed us to conduct our regression analysis with fewer variables and more degrees of freedom.

Variable impact on migration

While random forest models can tell researchers which variables are the most important for predicting the migration outcome, they do little to provide insight into how specific variables impact migration. To dig deeper into the variables identified by the random forest models, survival analysis was implemented, which further illuminates how salient variables related to location, livelihood, and family structure might impact a household’s risk of internal migration in coastal Bangladeshi communities. The univariate Cox proportional hazards models outlined in Table 2 demonstrate that the number of members in a household, the year a business is started, whether or not the household owns a refrigerator, and whether or not the household owns a gas cooker were significant. Latitude, longitude, and variables related to the most recent cyclone did not contribute significantly to the hazard function or reflected too much uncertainty to be reliable.

It is especially surprising that latitude and longitude were not significant covariates given that they were the first and fourth most important variables identified by the previous work using random forest algorithms. It was thought that latitude especially would be significant, because there is a clear gradient of increasing soil salinity from north to south in Bangladesh, and previous studies have suggested that soil salinity is important for driving migration in Bangladesh (Chen and Mueller 2018). It is possible that the random forest algorithm is able to identify nonlinear and non-monotonic patterns in the latitude and longitude data, whereas the Cox proportional hazards model assumes monotonicity in the baseline hazard function and multiplicative effects of the predictors on the hazard. For example, the random forest algorithm would be able to identify geographic clusters of migration and the Cox proportional hazards model would not.

The best performing one of the nested models was the complete model with year business was started, refrigerator ownership, gas cooker ownership, total members in the household, union, and prepared meal consumption by household head. This final model had a generalized R2 value of 0.069 and a concordance of 0.657 (Table 3). It is possible that this value of R2 is so low because, again, the covariates are unlikely to follow a simple multiplicative relationship assumed by the Cox proportional hazards model.

Despite the low value of R2 and concordance, the multi-variate Cox proportional hazards model is useful in beginning to understand how these variables influence the underlying risk of migrating. The values of hazard ratios shown in Fig. 2 quantify these impacts. The hazard ratios to the left of the dotted line in the figure show the variables have a negative impact on the overall risk of migration. This means that these variables decrease the underlying hazard. These variables include total household members, not owning a refrigerator, and not owning a gas or kerosene cooker. Hazard ratios that fall to the right of the dotted line in Fig. 2 show variables that have a positive impact on migration, meaning they increase the underlying hazard of migration. These variables are the number of non-workers in the household and prepared meal consumption by the spouse of the household head.

These results show that the total number of members of a household has a negative impact on migration, while number of non-workers in the household seems to increase migration by the household head. It is possible that this is in part due to the importance of remittances that migratory members of a household can send home to support their families (Massey 1990). A household with a higher number of non-workers to support may be more dependent on remittances from a migratory head of household. However, it seems that larger households may also create an anchoring effect that keeps the head of household from migrating, perhaps because migrating from the household, even temporarily, would leave the household more vulnerable and economically stressed. Such results could also indicate that having additional household members increases attachment to place and voluntary non-migration (Adams 2016; Mallick and Schanze 2020). This suggests that household size has a complex and possibly non-monotonic effect on probabilities of migration which reflects household livelihood capacity as well as vulnerability. This also supports existing literature that suggest that migration decisions may be primarily made at the household level (Massey et al. 1993).

Implications for migration / non-migration research

As several researchers have noted, the field of environmental migration has been growing over time, as have the methods employed (Piguet 2022). Despite advancements in the field, there remain important and unanswered questions related to how environmental or climatic change interact with mobility and immobility (i.e., migration versus non-migration) decisions (Mallick and Schanze 2020). In addition, how do migration and non-migration decisions vary across individuals and households and across contexts? Just as the drivers of environmental migration are acknowledged to be complex and interconnected, the drivers of environmental non-migration (both voluntary and involuntary) must be similarly studied in detail by the field (Mallick and Schanze 2020).

This work, which combines survey data, random forest algorithms, and survival analysis to investigate migration in rural Bangladeshi communities, has several important implications for migration and non-migration research. First, we provide specific insights into drivers of migration and non-migration in Bangladesh. We show that indicators of lower economic resources (not owning a refrigerator and not owning a gas or kerosene cooker) work to reduce mobility, suggesting that much of the non-mobility in our study location may be involuntary and driven by a household’s inability to afford to move. Similarly, number of non-workers in the household increases mobility, which supports the idea that much mobility in the area is primarily motivated by the desire to seek livelihood opportunities outside of the origin community (Bernzen et al. 2019; Biswas et al. 2019).

More broadly, by using multiple machine learning methods in combination, we provide an example of how survey data can be used to provide insights into (non-)migration when the relevant underlying theory is unknown or unclear. Mallick and Shanze propose that migration and non-migration may be considered on a spectrum of aspirations and capabilities (2020). However, how these aspirations and capabilities may be operationalized in data remains unclear. The methods demonstrated here can be used to identify important variables from existing datasets and then quantitatively show how those variables amplify or dampen mobility. These methods may be applied to different datasets and contexts and would yield context-specific insights into which factors influence (non-)migration.

Conclusion

Machine learning methods can be useful tools for researchers to study environmental migration when theory is not clearly established, as is the case with the emergent theory of voluntary non-migration. Though the specific machine learning algorithm used will vary based on research objectives and data used, this work applies random forest models to a household survey of migration in Bangladesh in order to identify salient variables. An important downside to random forest models is that despite quantifying variable importance, they do not provide insights into how the individual predictors relate to the outcome variable (e.g., does increasing the predictor variable increase or decrease the outcome variable?). Therefore, where theory testing or development is the goal, complex machine learning algorithms such as random forest models may not be useful in isolation. Instead, researchers may use machine learning to direct additional analysis using more traditional regression analysis or, as in this case, survival analysis. This multi-methods analysis provides insights into migration dynamics, but it does not begin to accurately quantify migration risks. Assumptions of linearity in the survival analysis contribute to low predictive power, further demonstrating the strengths and weaknesses of different algorithms and methods.

Future work should continue to develop modeling methods that are able to capture the complex relationship between the many factors that contribute to migration or non-migration decisions. In this process of improving methods, it is likely that no one method will be a clear solution to existing challenges, but methods that draw from the best available computer science methods will likely be important (Neumann and Hilderink 2015; Obokata et al. 2014). In this way, researchers should remain open to investigating new techniques that may be useful, such as more complex machine learning algorithms.

References

Download references

Funding

This work was supported by the National Science Foundation Coupled Human-Natural Systems Grant No. 1716909.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kelsea Best.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the Topical Collection on Environmental Non-Migration: Frameworks, Methods, and Cases

Communicated by Robbert Biesbroek and accepted by Topical Collection Chief Editor Christopher Reyer.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 198 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Best, K., Gilligan, J., Baroud, H. et al. Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests. Reg Environ Change 22, 52 (2022). https://doi.org/10.1007/s10113-022-01915-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10113-022-01915-1

Keywords

  • Random forests
  • Machine learning
  • Human migration
  • Bangladesh