1 Introduction

Migration has become one of the most salient issues confronting policymakers around the world. The historic adoption of the Global Compact for Safe, Orderly and Regular Migration (GCM)—the first-ever intergovernmental agreement on international migration—and the Global Compact for Refugees in December 2018 and the inclusion of migration-related targets in the 2030 Agenda for Sustainable Development are a clear testament to this. These frameworks have also provided a renewed push to calls from the international community to improve migration statistics globally. The first of the 23 objectives of the GCM is about improving data for evidence-based policy and a more informed public discourse about migration. As a matter of fact, many countries still struggle to report basic facts and figures about migration, which limits their ability to make informed policy decisions and communicate those to the public, but also limits the ability of researchers to contribute to the production of evidence and knowledge on migration.

Migration is a complex phenomenon to measure. Population changes generally happen slowly as fertility and mortality tend to impact population dynamics gradually. However, a country’s population structure might change more rapidly due to migration (Billari, 2022). Migration, and in particular international migration, has become increasingly important in shaping population change, especially in higher-income countries, where fertility is decreasing (Bijak, 2010). The study of migration is affected by many challenges (i.e. availability of data, measurement problems, harmonisation of definitions) (Bilsborrow et al., 1997). Above all, there is a lack of timely and comprehensive data about migrants, combined with the varying measures and definitions of migration used by different countries, which are barriers to accurately estimating international migration (Bijak, 2010; Willekens, 1994, 2019). Despite the best efforts of many researchers and official statistics offices, international migration estimates lack quality due to the limited data available in many countries (Kupiszewska & Nowok, 2008; Poulain et al., 2006; Zlotnik, 1987). Migration is a topic widely discussed in several research fields including demography (Lee, 1966), sociology (Petersen, 1958), political science (Boswell et al., 2011), and economics (Kennan & Walker, 2011). Insufficient availability of quality data on migration can have a high social and political impact, because these inaccuracies might limit the capacity to take evidence-based decisions.

The main data sources used to measure migration are censuses, administrative records, and household surveys, collectively referred to as ‘traditional data sources’. These data sources have limitations related to the definition of migrants (i.e. the discrepancy between internationally recommended definition and applied definitions in each country), coverage of the entire migrant population, and the quality of the estimates (especially for admin records) (Azose & Raftery, 2019; Willekens, 2019). Moreover, traditional data on migration are not promptly and regularly available. There might be a gap of several months or even years between the time the data are collected and statistics are released to the public. Timely and granular migration data are needed not only for research purposes but also for informed policy and programmatic decisions related to migration. In times of global crisis, such as the COVID-19 pandemic or the Russian invasion of Ukraine, the need for accurate and timely data becomes particularly urgent, but the capacity to collect data from traditional sources can be significantly reduced (Stielike, 2022).

In the last 25 years, the world has experienced a data revolution (Kashyap, 2021). New data created by human digital interactions increased dramatically in volume, speed, and availability. The data revolution did come not only with the advent of new data sources but also with increased computational power. This, in turn, helped to create more sophisticated models to study social phenomena such as migration. New ‘ready-made’ data from digital sources, commonly referred to as ‘digital trace data’ (Salganik, 2019), have started to be repurposed to answer social science questions.

Cesare et al. (2018) addressed the challenges faced by social scientists when using digital traces. One of the main challenges is related to bias and non-representativeness, as users of social media platforms, for instance, are not representative of the broader population and might not necessarily reveal their true opinions or personal details. Correspondingly, understanding how to measure the bias of these online non-representative sources is critical to infer demographic trends for the wider population (Zagheni & Weber, 2015). Once the biases are quantified, one possible next step is to combine different data sources to extract more information and enhance the existing data. This is an ongoing process in which social scientists have started to combine survey data with digital traces, originally created for marketing, and repurposing them for scientific research (Alexander, Polimis and Zagheni, 2020; Gendronneau et al., 2019; Rampazzo et al., 2021; Zagheni et al., 2017). The idea of repurposing data is not new to the social sciences (Billari & Zagheni, 2017; Sutherland, 1963; Zagheni & Weber, 2015). For example, John Graunt’s first Life Table (1662) was in fact a reworking of public health data from the Bills of Mortality to infer the size of the population of London at the time (Sutherland, 1963).

New data sources are a gold mine for migration studies because they offer an opportunity to address the lack of information which hinders this field of research. Digital traces (especially social media data) are quick to collect using, for example, Twitter’s or Facebook’s application programming interface (API)Footnote 1 (for a comprehensive overview of digital trace data for migration and mobility, check Bosco et al., 2022). This allows to know in close to real time how many of the users are in a specific location and have recently changed their country of residence or are foreign-born, contributing to ‘nowcasting’ migration (e.g. monitoring trends almost in real time). However, digital traces are not always available to academics and practitioners, as they are mostly owned by businesses and may not be fully and publicly accessible.

This chapter has two objectives. First, it aims to bring examples of how new data sources and methodologies have been used for studying migration and migrant characteristics. Second, it highlights advantages, limitations, and challenges of digital trace data in migration research.

2 New Data in Migration Research

As a statistical concept, international migration has been historically characterised by five building blocks:Footnote 2 (i) legal nationality, (ii) residence, (iii) place of birth, (iv) time, and (v) purpose of stay (Zlotnik, 1987). As these blocks are complexly entwined with each other, statistical systems use one or a combination of them to gather data on international migrants. The United Nations recommends a definition of international migration which explicitly focuses on residence and time (UN, 1998), defining a migrant as a ‘person who moves from their country of usual residence for a period of at least 12 months’. Migrants that stay between 3 and 12 months are considered to be short-term migrants. The intended purpose of the UN’s definition of international migrants is to harmonise data sources worldwide. However, current definitions of migrants vary between countries. While they all depend on the time of stay outside of the country of usual residence, definitions applied at the national level differ (i.e. ‘minimum duration of stay in the destination country required for the change of residence in the origin country’ Kupiszewska and Nowok, 2008, p. 58) (Kupiszewska & Nowok, 2008; Willekens, 1994).

It has been suggested that digital traces can help refine migration theory and modelling. Fiorio et al. (2017) and Fiorio et al. (2021) highlight the potential of using geotagged Twitter data to investigate short-term mobility and long-term migration. Indeed, the definition of an international migrant has become tied up with the increase in the number of individuals living transnational lives (Carling et al., 2021). Digital trace data might help broaden or qualify the distinction between short-term and long-term migrants, adding nuances. However, we need to consider that digital trace data do not follow the same definition as traditional data sources. For example, on Twitter, migrants can be identified through changes in their location over a period of time, while Facebook provides on their Advertising Marketing Platform a variable that can be used to characterise migrants. The Facebook variable is defined as ‘People that used to live in country x and now live in country y’ (Rampazzo et al., 2021), which refers to the concept of residence and usage of the social media. The Facebook migrant definition does not account for the time aspect, which creates problems when comparing official migration statistics and Facebook estimates. In Zagheni et al. (2017), the description of the Facebook migrant variable was ‘Expat from country x’, which highlights that the definition behind this variable may be subject to change.

The information on the categorisation of migrant users on social media is limited. In the case of Facebook, the evidence comes from internal and external research. Migrant users might be identified not only through self-declared public information (e.g. ‘hometown’) but also through inferred information based on their use of the social media (e.g. user’s IP address) (US SEC Commision, 2018, 2019, 2020). Spyratos et al. (2018) conducted a survey of 114 Facebook users asking them to check whether they were classified by the Facebook Advertising Platform as migrants. The majority of the non-representative sample was classified correctly as an ‘expat’ despite not having self-reported country of birth or of previous residence on Facebook. Moreover, Facebook’s researchers declared to use ‘hometown’ as a feature for characterising migrants (Herdağdelen et al., 2016). On Twitter, migrants are typically identified through geo-targeting for research studies. However, the number of geo-tagged tweets is limited: only 2/3% of the tweets are provided with a geo-location (Halford et al., 2018; Leetaru et al., 2013). Fake and duplicate accounts might also be a challenge when studying migrants on social media. For Facebook, the percentages of fake and duplicated accounts are reported every year on the US Securities and Exchange Commission documents and are stable at a 11% duplicate accounts and 5% fake accounts (US SEC Commision, 2018, 2019, 2020). Therefore, possible algorithm changes on the measure provided may affect continuity of data from these sources. Case in point, previous work (Palotti et al., 2020; Rampazzo et al., 2021) identified discontinuities in the Facebook data in March 2019 leading to a drop in the global estimates of the number of migrants active on the platform.

Although migrants are not clearly defined in digital trace data, stock estimates of migrant populations seem to be proportionally comparable to traditional data estimates. Zagheni et al. (2017) showed that Facebook Advertising data and American Community Survey data are highly correlated. Moreover, Facebook Advertising data has proved to be faster in capturing out-migration from Puerto Rico in the aftermath of Hurricane Maria. Alexander et al. (2020) show how Facebook Advertising data allowed to provide monthly estimates of the relocation of Puerto Ricans to mainland USA, and subsequent return migration, which traditional data sources were not able to register. The same result is supported by the use of Twitter data (Martín et al., 2020), as well as by monthly Airline Passenger Traffic data used by the US Census Bureau.Footnote 3 Facebook Advertising Platform could also be used to monitor out-migration from a country experiencing political turbulence, such as Venezuela (Palotti et al., 2020). These examples highlight another important feature of digital trace data: their broad geographic availability. These data can be widely available also in contexts of poor traditional statistics (e.g. low- and middle-income countries); for example, the Facebook migrant variable is available for 17 of the 54 African countries (Rampazzo & Weber, 2020).

Facebook Advertising data has also provided insights on migrant integration in Germany and the USA (Dubois et al., 2018; Stewart et al., 2019). Cultural assimilation was studied through the comparison of interests expressed online by the German population and Arabic-speaking migrants in Germany (Dubois et al., 2018). Results shows that Arabic-speaking migrants in Germany are less culturally similar compared to other European migrants in Germany, but the divide is less pronounced for younger and more educated men. Similarly, cultural integration in the USA was investigated through self-reported musical interests between Mexican first- and second-generation migrants and Anglo and African Americans (Stewart et al., 2019). The comparison between self-reported musical interests highlights that education and language spoken (e.g. English versus Spanish) are key characteristics determining assimilation. However, these studies are affected by limitations linked to self-reported information and ‘black box’ algorithms estimating interests on social media platforms.

Analysis of digital traces can do more than help with estimation of current migration stocks. Non-traditional data sources can also provide insights into migration intentions, migration flows, and more. For example, Google Trends data going back to 2004 has been used to estimate migration intentions and subsequently predict flows to selected destination countries (Böhme et al., 2020). Böhme et al. (2020) complemented Google Trends with survey data to predict migration flows and intentions. Their results are robust, but the authors highlight as a limitation that the predictive power of words chosen might change over time. Moreover, the models had higher performance when focusing on countries where internet usage is high (Böhme et al., 2020).

Wanner (2021) used a similar approach with Google Trends data to study migration flows to Switzerland from France, Italy, Germany, and Spain. They found that Google Trends data can anticipate migration flows to a certain extent when actual migration is decreasing in volume. Avramescu and Wiśniowski (2021) focused on Google Trends searches related to employment and education from Romania directed to the UK, creating a composite indicator in a time series model. They obtained mixed results in terms of predictive power, stressing that knowing the context of the origin and destination countries is important to increase accuracy of the predictions. Despite the challenges, all the authors agree that Google Trends is a powerful source for estimating potential migration.

New opportunities might arise also from consumer data from the retail sector (e.g. from basket analysis). For instance, some studies show how food consumption patterns can shed light on integration aspects (Guidotti et al., 2020; Sîrbu et al., 2021). Moreover, companies such as LinkedIn, Indeed, and Duolingo provide reports on their users that might reflect migration dynamics. LinkedInFootnote 4 and IndeedFootnote 5 reports focus on economic migration, providing insights on the international job market, while DuolingoFootnote 6 featuring the most studied language per country shows, for example, how Swedish is the most popular language in Sweden or that German is the top language studied in the Balkans.

This section has looked at multiple digital data sources and what they can bring to the field of migration studies. Clearly, digital trace data have huge potential given their timeliness and wide geographic availability. However, calibrating new data sources with and validating them against traditional data are essential to use novel sources effectively for migration analysis and policy. New digital data offer possibilities to study a diverse range of topics, including the scale of migration, intentions to migrate, and integration and cultural assimilation of migrants. Given their wide applicability to often politically sensitive topics, such as migration and human displacement, social scientists should critically reflect on the risks of results being misinterpreted, or, worse, misused, and how unethical uses of the data could harm individuals, particularly those in vulnerable situations, and infringe upon their fundamental rights (Beduschi, 2017). While many of the applications of computational social science to study are motivated by a potential positive impact on both migrants and the wider society, similar methods could be used to limit freedom and rights of migrants (for a comprehensive analysis of ethical considerations, see Taylor, 2023).

3 New Opportunities in Migration Research

The Digital Revolution has brought not only new data sources but also opportunities to apply new methodologies or augment research possibilities. Modelling migration is necessary because of the lack of quality in migration data from both traditional and digital sources. Digital trace data needs to be calibrated with traditional data. A natural way of combining data sources is through Bayesian models; indeed, Alexander et al. (2020) suggest a framework to combine migration data from multiple sources over time through a Bayesian hierarchical model. One level of the model focuses on adjusting the bias related to non-representative data (e.g. digital trace data) for a ‘gold standard’ given by survey data (e.g. the American Community Survey). Rampazzo et al. (2021) proposed a Bayesian hierarchical model as well. Their model combines traditional and digital data considering both data sources to be biased. Both frameworks stress that digital trace data cannot be a substitute for traditional data sources and that more accurate results can be obtained through their combination, rather than replacement.

Moreover, social media could also be actively used to recruit survey respondents. Advertisements on social media can be repurposed to recruit survey participants to answer a questionnaire. Facebook and Instagram have been used to recruit survey respondents during the COVID-19 pandemic (Grow et al., 2020), LGBTQ+ minorities (Kühne & Zindel, 2020), but also migrants (Pötzschke & Braun, 2017; Pötzschke & Weiß, 2021). Recruiting migrant respondents for traditional sampling strategies is notoriously challenging. However, social media advertising platforms such as that offered by Facebook provide the opportunity for non-probabilistic sampling of migrants, through the use of the migration variable.Footnote 7 Pötzschke and Braun (2017) used Facebook to sample Polish migrants in four European countries—Austria, Ireland, Switzerland, and the UK. In the 4 weeks during which the ads were running, a total of 1100 respondents were recruited with a budget of 500 euro. Moreover, Pötzschke and Weiß (2021) used a similar design on Facebook and Instagram to recruit German migrants worldwide. 3800 individuals completed the questionnaire from 148 countries. The advantage of this strategy is to recruit migrant respondents worldwide in a timely manner and with modest budgets. However, it is challenging to produce representative results as there is no control over who opts in to the survey. This necessitates techniques such as post-stratification to make the results more representative of the specific migrant population. It may be worth noting that similar techniques are also used in traditional surveys (e.g. re-weighting, re-calibration), though with surveys on social media, the lack of a probability sampling results in a necessity to post-stratify.

Narratives around migration are usually investigated through qualitative interviews (Flores, 2017; Rowe et al., 2021). The proliferation of social media has also increased the volume of publicly available text that can be analysed to study general perceptions, narratives, and sentiments on a variety of topics. For instance, Twitter can also be used to analyse sentiments towards migrants and migration (Flores, 2017; Rowe et al., 2021). In 2010, the state of Arizona implemented an anti-immigrant law, the effect of which was studied using 250,000 tweets with natural language processing (NLP) techniques and a difference-in-difference design (Flores, 2017). Analysing the content of the tweets, the author stressed that policies have an effect on the perception of migrants, proving that micro-blogging data are an alternative source for public opinion on migrants (Flores, 2017). In Europe as well, analysis of Twitter text data delivered insights on sentiment towards migrants, describing a situation of polarisation of opinion (Rowe et al., 2021). The data provide an opportunity to track population sentiment towards migration in close to real time and monitor shifts over time. Moreover, focusing on the language used on social media, NLP might be useful to identify migrants and study migration flows (Kim et al., 2020).

High-intensity (e.g. weekly or monthly) time series are an opportunity to monitor change and create early alert systems for shifting migration patterns. Napierała et al. (2022) proposed a cumulative sum model to detect changes in trend of asylum applications. The use of flow data and early warning systems could help policymakers in anticipating refugee movements and improve preparedness and management capacities, if handled ethically and responsibly. However, these data and models can be used to make it more difficult for individuals to exercise their rights under the International Human Rights Law. Administrative data sources hold great potential for the study of migration patterns but present specific issues: for instance, their coverage is limited to the extent that people officially register or de-register from countries’ administrative systems; also, administrative records track events (e.g. asylum applications), not individuals, and are affected by issues of double-counting and biases that may affect their usability for official migration statistics. Eurostat data on number of applications lodged (which might also be biased) in EU countries could be augmented by including digital trace data in the model, increasing the ability to potentially anticipate future trends. This approach is suggested by Carammia et al. (2022) through an adaptive machine learning algorithm which combines data from Google Trends and traditional data sources. Given their frequency, data from social media platforms and Google Trends could indeed contribute to the early identification of shifting trends and, if managed responsibly, to greater capacities of migration policymakers and practitioners to inform adequate and timely measures (Alexander, Polimis and Zagheni, 2020; Martín et al., 2020).

Projects like Refugee.Ai and GeoMatchFootnote 8 propose to use data-driven algorithms to assign refugees across countries and improve their integration prospects (Bansak et al., 2018). Providing examples for the USA and Switzerland, Bansak et al. (2018) describe an algorithm based on supervised machine learning and optimal matching which takes into account the refugee characteristics (e.g. age, gender, language, education) and local site characteristics. The authors bring evidence of an improvement in subsequent refugee employment outcomes (from 34 to 48%). Moreover, they suggest that the model is flexible and can focus on different integration metrics to optimise for. The matching system is described also in the context of the UK (Jones & Teytelboym, 2018). Similar systems have been suggested also in Sweden to match refugees and property landlords (Andersson & Ehlers, 2020). Nevertheless, automated decisions should always be accompanied by a human element of review to avoid risks of algorithmic bias and human rights infringements.

There is evidence that also computational methods such as machine learning and neural networks might provide insights on migration. Simini et al. (2021) suggested a gravity model with deep neural network to predict flows of migrants and demonstrated that the model performed better than other models due to its geographic agnosticism. Moreover, convolutional neural networks might lead to new ways of fusing data and master high-frequency data (Pham et al., 2018).

4 The Way Forward

This chapter has demonstrated how the Digital Revolution has provided new data sources and opportunities to researchers. Timely data on migration are important not only for academics but also for policymakers and practitioners to design data-driven policies and programmes. The COVID-19 pandemic has stressed the importance of having timely and accurate mobility data for the study of the diffusion of the virus (Alessandretti, 2022). However, data from digital traces often lacks a clear definition of what is being measured. Since such data are obtained from private companies, there may be no information available about the algorithms used to produce migration and mobility estimates, for example, about the specific criteria used to classify migrants. A clearer understanding of the construction of these measures would allow to include these data sources in models with more precision.

In the future, it would be important to create sustainable systems for safe and secure access to the data. At the moment, much of this research is dependent on application programming interfaces (API), which as attested by Freelon (2018) might be closed suddenly. When APIs are not available, web-scrapingFootnote 9 might be a solution, but terms and conditions of the project as well as ethical implications should be taken into account. Initiatives such as the Big Data for Migration Alliance (BD4M),Footnote 10 convened by IOM’s Global Migration Data Analysis Centre (GMDAC), the EU Commission Knowledge Centre on Migration and Demography (KCMD), and the Governance Lab (GovLab) at New York University, aim to provide a platform for cross-sectoral international dialogue and for guidance on ethical and responsible use of new data sources and methods. Social Science One Footnote 11 tries to create partnerships between academic researchers and businesses. At the moment, it has an active partnership with Facebook, established in April 2018. The initiative is led by Gary King (Harvard University) and Nathaniel Persily (Stanford University). The goal is to give researchers access to Facebook’s micro-level data after having submitted a research proposal. There are significant privacy concerns from this, however, which has created delays in the process. On February 13, 2020, the first Facebook URLs dataset was made available; ‘The dataset itself contains a total of more than 10 trillion numbers that summarize information about 38 million URLs shared more than 100 times publicly on Facebook (between 1/1/2017 and 7/31/2019)’.Footnote 12 A research proposal is needed to apply for access to such datasets; this is the first step in analysing large micro-level datasets from private social media companies. Companies also often control the analysis produced with their data. Researchers using companies’ data have to follow strict contracts on its use and seek approval on the results before publication. The Social Science One initiative is interesting in this regard as it comes with pre-approval from Facebook. However, it also highlights challenges of relying on Facebook-internal teams to prepare the data in a non-transparent matter: recently, Facebook had to acknowledge that, accidentally, half of all of its US users were left out of the provided data.Footnote 13 This essentially invalidated any work done with the data so far, including that of PhD students. To avoid such issues, ultimately caused by a lack of external oversight, researchers are increasingly calling for legally mandated corporate data-sharing programmes to enable outside, independent researchers to analyse and audit the platformsFootnote 14 (Guess et al., 2022).

Overall, the value of new data sources and new models cannot be underestimated. However, applications of these tools for research and public policy purposes should follow high ethical and data responsibility standards. New data sources and AI-based technologies could help researchers and policymakers improve prediction abilities and fill information gaps on migrants and migration, but the use of these technologies should be closely scrutinised and comprehensive risk assessments undertaken to ensure migrants’ fundamental rights are safeguarded. The purposes of machine learning- and AI-based applications should be clearly communicated, and participatory approaches that empower migrant communities and ‘data subjects’ more generally should be promoted in research and policy domains, with a view to increasing transparency and public trust in these applications, but also provide guarantees for the protection of individual fundamental rights (Bircan & Korkmaz, 2021; Carammia et al., 2022). Many technologies come with a risk of being used to create ‘digital fortresses’Footnote 15 in which these tools keep out migrants, rather than support them. Hence, social scientists and other researchers should carefully weigh the risks and potential repercussions when using digital traces.