Introduction

International interest in measuring human migration is at an all-time high. The number of people living in a country other than their country of birth reached an estimated 272 million in 2019, an increase of 51 million from 2010 (UN DESA, 2019). The number of forcibly displaced people due to conflicts and disasters is at its historic high (UNHCR 2020). International migration is expected to continue increasing given higher levels of interconnectedness in the world due to improved communication and transport systems, protracted crises producing displacement, and structural changes such as climate change and population growth in certain world regions.

Governments worldwide have a keen interest in anticipating future migration flows and understanding the drivers of migration to plan ahead, allocate funds, attract workers and students, use remittances, facilitate migrant integration, and manage public opinion, among other issues. The increased demand for systematic measurement from policymakers has also manifested itself in two landmark policy frameworks adopted in the last decade: The Global Compact on Migration (GCM) and the Sustainable Development Goals.Footnote 1

Second, in step with the increased salience of migration in policy circles, migration research output has grown dramatically (Pisarevskaya et al., 2019). Between 1960 and 1980, the number of academic journals on related subjects quadrupled. In 2020, the International Organization for Migration (IOM) identified over 130 migration-related journals publishing more than 2000 journal articles in English, French or Spanish (IOM, 2020).

Policy and scholarly interest both rely on fundamental measurements of migration, i.e. how many people migrate (flows) or have migrated (stocks) within a specific time frame. Yet, the popularity and relevance of migration has outpaced substantial improvements in the systematic measurement of migration, especially at the global level. Indeed, the demand for ‘evidence’ has revived long-standing calls for better data on international migration which experts have been lamenting for decades (Bilsborrow et al., 1997; Clemens et al., 2009; Laczko, 2016; Lemaitre, 2005; Willekens et al., 2017).Footnote 2

This is the context in which a set of new data sources emerged providing migration researchers of all disciplines with new opportunities to measure migration. The arrival of “innovative data sources”—often referred to as “Big Data” or “digital trace data”—have been described as a “migration data revolution” (Laczko & Rango, 2014) and bears much potential to complement traditional migration data (Cesare et al., 2018; Sîrbu et al., 2021). At the same time, digital data present a host of new ethical challenges for researchers that are of great concern (Beduschi, 2020; Brayne, 2018; Hayes, 2017; Latonero & Kift, 2018; Leese et al., 2021; Molnar, 2019; Zwitter, 2014).

As new researchers and policymakers flock to the field of migration and the empirical study of migration diversifies, there is a need to explain and review new migration data sources to provide a better understanding of their respective limitations and strengths. This paper reviews eight data sources in terms of their reliability, validity, scope for research, access and ethics. The aim is to familiarize experienced and incoming migration scholars with new approaches. The review should be considered an attempt to contribute towards a broader process of interdisciplinary dialogue and expanding the empirical toolbox in migration studies.

Analytical framework

Defining migration

Before mapping novel data sources, it is important to first define migration for the purpose of this study and clearly delineate the scope of the review.

Important aspects of definitions of migration are space (internal or domestic vs. international/ cross-border migration), time (short term vs. long term), type (e.g. labour, irregular, forced, family, education) and form (flows vs. stocks) (Bilsborrow, 2016; UN DESA, 2017). This review takes a broad and inclusive view in line with its aim to describe a menu of novel data sources for diverse groups of migration scholars and research interests. Here, migration will be defined as the changing of residence of an individual within or outside the boundaries of a country for longer than three months. While this definition is broad, it excludes certain types of mobility such as travel for the purpose of recreation, holiday, visits to friends or relatives, business, medical treatment or religious pilgrimage (which usually do not imply a change of residence).Footnote 3 The review considers data sources providing information on both migration flows (i.e. the number of migrants entering and leaving (inflow and outflow) a country over the course of a specific period, for example, one year) and migration stocks (i.e. the total number of migrants present in a given location at a particular point in time) (see Global Migration Group, 2017). The review considers any form and channel of migration including, among others, labour, family, forced and irregular migration and is not restricted geographically. In addition, to actual observed migration, the review also considers proxies for migration; such as, migration intentions, plans, desires, or aspirations that are commonly used to predict a future change of residence (Tjaden et al., 2019).Footnote 4 The review does not include how novel data can be used to research other fields of interest to migration scholars such as integration, the causes of migration, communication, or the impact of migration on society. These fields are not primarily concerned with or rely on (directly or indirectly) inferring migration flows or stocks from data.

Defining “novel” data sources

What are these “new” data sources and what makes them “newer” than the “old” sources? Two popular concepts are helpful to delineate traditional from “innovative” data sources for migration: “big data” and “digital trace data”.

Big data is commonly defined by the “three V’s”: volume, velocity, variety. Volume refers to the magnitude of data. However, there is “little consensus around the fundamental question of how big the data has to be to qualify as “big data” (Gandomi & Haider, 2015: 137). Velocity refers to the rate at which data are generated which has dramatically increased with the proliferation of digital devices such as smartphones and sensors. Variety refers to the type of data that is being generated. Big data often includes numerical data, text data, images, audio and geo-location data. These Vs are useful to describe many of the sources that are commonly associated with big data such as social media platforms (Facebook, Twitter, Twitter, Instagram etc.) and online search platforms (google). These data compile millions of records about their users ranging from location, online activity to demographic user profile information. At the same time, this information is accessible in real-time, sometimes even publicly through application programming interfaces (APIs). “Big data” includes social media data but is not limited to it. Google, for example, offers a search service which does not operate as a social platform.

Digital trace data are the “results of social interaction via digital tools and spaces as well as digital records of other culturally relevant materials, such as archived newspapers and Google searches including data from popular social networking sites (such as Facebook or Twitter), personal blogs, collaborative online spaces (such as Wikipedia), and data derived from mobile phone or credit card usage” (Cesare et al., 2018: 1980).

The terms “big data” and “digital trace data” refer largely to the same type of sources. However, while term “big data” highlights the type of data that is produced, the term “digital trace” data focuses on how the data is produced, i.e. through using digital devices (Cesare et al., 2018; Hughes et al., 2016; Sîrbu et al., 2020).

New data sources are often collected by private companies for the purpose of offering services to customers. In contrast, “traditional” sources of migration data such as censuses, administrative data and surveys are traditionally collected or made available by government agencies or (publicly funded) research institutes.Footnote 5 This has far-reaching ethical and empirical implications which will be discussed in later sections.

Review criteria

The review will discuss new data sources along five domains: (1) Reliability—the consistency and reproducibility of migration measurements, (2) validity—the accuracy of migration measures and the extent to which data allows to capture the intended concepts used by migration researchers, (3) scope—the breadth and depth of migration-related research that could be explored based on the respective data source, (4) accessibility—the degree to which data is accessible to researchers, and lastly, (5) ethics –the potential risk of violations of data privacy, consent and data protection principles in the data generation process and potential risk of (unintended) harm for research subjects as a result of analysis produced based on new data sources (e.g. Beduschi, 2020; Brayne, 2018; Cesare et al., 2018; Hayes, 2017; Latonero & Kift, 2018; Leese et al., 2021; Molnar, 2019; Zwitter, 2014).

Review of innovative data sources to measure migration

Mobile phones: call detail records and GPS data from smartphone operating systems

Mobile phone Call Detail Records (CDR) can track the approximate location of individuals and, as a result, display movements across space by capturing the call signal sent to cell towers for each outgoing and incoming call (Williams et al., 2015). All caller details are anonymized. Some telecommunications providers are amenable to social research as well, and often provide documented and anonymized digital trace data from their customers to researchers interested in analysing these data (e.g. Cesare et al., 2018; Chi et al., 2020).

Reliability CDR provides reliable measures of migration in terms of consistency over time. Movements are recorded automatically as required by operating the telephone network. An advantage in terms of reliability is that the information on location does not rely on self-reports by individuals, which may be subject to response biases (a common issue in surveys). However, reliability issues may apply when using CDR data from different operators. This is a common issue because most countries have several telecommunication companies. As consumers switch services, measures of movement over time become less reliable. As a result, CDR is often used on narrowly defined locations and limited time frames.

Validity The key disadvantage is that such data refers to mobile devices, not individuals as such. It is possible that individuals will share the same device, or gift it to others. Furthermore, many migrants may change devices and/or SIM cards after migrating to other countries, given that service providers offer deals limited to particular countries. Therefore, most contributions using phone data have analysed mobility within narrow geographic units (cities, regions) rather than movements across borders. Furthermore, CDRs can be biased because locations are only recorded when calls are made leaving blank spots in the migration process.Footnote 6

Scope While CDR data are usually more helpful for identifying internal (sub-national) migration patterns,Footnote 7 in some cases they can also be used to measure international migration at the sub-regional level, particularly when combined with other sources. For example, CDR have been used to track internal displacement following natural disasters such as the Haiti and Nepal earthquakes (Bengtsson et al., 2011), and the combination of CDR with satellite data can help to map movements between cross-border communities (Hughes et al., 2016). Recent work has leveraged Google Location History data for analysis on migration flows. Google Location History is collected through smartphones that operate the Google Android system and through Google services used through smartphones (e.g. Google Maps or Gmail). Pilot research suggests that this novel source of information could provide information about international migration through ‘fine scale mobility with rare, long distance and international trips’ documented through changes in location by users (Ruktanonchai et al., 2018). Using the same data, Kraemer et al. (2020) described ‘global human mobility patterns, aggregated from over 300 million smartphone users’. According to the authors, the data cover nearly all countries and 65% of earth’s populated surface, including cross-border movements and international migration. The advantage of CDR and location data through smartphone use for measuring migration is the timeliness and detail regarding the location. As such, phone records are particularly useful for studying sudden movements in defined geographic locations. Fast evolving migration situations are difficult to capture with “traditional” data sources such as sample surveys and administrative data, and impossible using censuses.

Without linking mobile phone records to other data sources, CDR provides a limited scope for migration scholars. The only information available is time and location. The type, channel and motivation for a change in location remains unobserved. It thus remains unclear who moved, why people moved, where they wanted to go, who they travelled with, through which channel they travelled, and whether they are likely to stay in their current location. This lack of context information is a key shortcoming compared to “traditional” sample survey research.

Accessibility CDR data is not commonly available to researchers and access depends on willingness of telecommunication companies to collaborate. Different operators in different countries may need to comply with different data protection legislation limiting the extent and level of detail of data that can be shared. In addition, access is often tied to large fees.

Ethics CDR data poses serious ethical concerns. When entering a mobile phone contract or installing a smartphone operating system, many users may not be aware that their location data is collected and analysed for various purposes (Beduschi, 2020; Brayne, 2018; Molnar, 2019). Such data uses are often hidden in the fine print. Since telecommunication and smartphone operating system providers are often private companies, there is a lack of transparency of what companies do with the data. In many countries, governments can mandate companies to provide access to data, for example, for the purpose of criminal investigations (Brayne, 2018). In the field of migration, the granular CDR data can be used to target humanitarian assistance to specific populations in specific locations, however, it could also be used by authorities for enforcing immigration policies, border protection and identifying individuals entering or residing in a country with an irregular status.

Social media

Geo-located social media activity, such as activity on Facebook (Zagheni et al., 2017), Twitter (Chi et al., 2020; Fiori et al., 2017; Martin et al., 2020; UN Global Pulse, 2017; Zagheni et al., 2014), Skype (Kikas et al., 2015), or LinkedIn (State et al., 2014), have been used to infer migration flows and stocks based on the location where users log in or information on location provided by the users themselves through geo-tagged posts or profile information (e.g. nationality or birthplace).

Movement is usually inferred based on changes users make to their self-reported location on the respective platform, or changes in location of log-ins. For example, data from the Facebook advertising platform can yield information on ‘home country’ and country of current residence. This means that Facebook could be used as a ‘real-time census’ to estimate, among other things, the number of users classified by the social media platform as ‘expats’ (users living in a country other than their ‘home country’) at the national or global level at a certain point in time (Zagheni et al., 2017). Using changes in Facebook users’ locations over time, others have identified the increase in the number of Venezuelan migrants in Spain in early 2018, confirmed by official statistics from the Spanish National Statistical Office (Spyratos et al., 2019).

Reliability There are a host of reliability concerns involved in measuring migration using social media data. First, certain segments of the population may be over- or under-represented (for instance, on average, young people are more likely to use Facebook than older people).Footnote 8 Second, even frequent users may choose not to provide information on their past and current location. Certain types of migrants may deliberately avoid providing information on their location on social networks. Third, it is difficult to verify whether changes in location are accurate, given that this information is sometimes self-reported on a voluntary basis. Fourth, the user base of social media providers constantly changes, which complicates analysis of trends over time (see e.g. Cesare et al., 2018).

Validity With many kinds of social media data, there is a lack of transparency on key measures relevant for migration are generated. For example, there is limited information on how Facebook identifies who is an “expat” or how it labels users as speakers of a different language. This complicates meaningful interpretation of migration patterns observable in the data.

Scope The advantage of geo-located social media data is that, in many countries, certain social media platforms are wildly popular, so that real-time data on large volumes of movements can potentially be accessed. Such data may be particularly useful to study broader migration trends. The level of detail provided by geo-coded social media data is limited in many cases but more extensive compared to CDR data. For example, Facebook provides aggregate-level information on the number of users with specific characteristics such as age, gender, or even education and income proxies as well as a vast range of preferences (measured via users’ “likes” of particular pages). Changes in the characteristics of the number of people living in a specific place are used by researchers to infer ‘migration flows’, assuming that changes in the ‘stock’ of people that report that they live somewhere necessitates that people moves from countries with lower stock numbers to countries with higher stock numbers. Information on friendship networks across countries—recently made available by Facebook—may be used in the future to forecast cross-country migration trends (Tjaden et al., 2021). Despite availability of additional characteristics, the data provide no information about the causes, means, or consequences of migration. There are attempts by governments, law enforcement agencies, international organization and research institutes to monitor the social media activity of migrants before, during and after migration to understand changes in migration patterns (Brenner & Frouws, 2019; Dekker et al., 2018; Sanchez et al., 2018).Footnote 9

Access Many social media companies offer public APIs to allow access to certain parts of their data to researchers. In many cases (e.g. Facebook, Twitter), access can be obtained at no cost which is a substantial advantage over traditional sources such as censuses, administrative data and survey. However, access modalities can change at any given time because data is provided by private companies, rather than taxpayer-funded government or research bodies that are mandated to provide systematic data over time.

Ethics Users of social media are often unaware of the data that is being collected about them and there is a general lack of understanding how such data is and can be used by companies themselves or third parties (Cesare et al., 2018; Zwitter, 2014). Migration enforcement agencies may use such data for surveillance purposes, which are particularly serious in contexts of irregular migration and forced displacement. Agencies could monitor communication of specific groups or individuals on Twitter and Facebook to identify irregular migrants and track them during their journeys. Companies such as Facebook, however, only allow access to anonymized, aggregate level data to researchers which limits the possibility of using data to harm individuals. Any information on narrowly defined locations and groups becomes inaccessible if the underlying target population decreases beyond a threshold that risks identifying any specific individuals. However, this does not apply to attempts to monitor public communication in social media groups indicating changes in migration patterns. The European Asylum Support Office (EASO) has suspended its efforts to monitor communication of migrants on social media following concerns by the EU own data protection body.Footnote 10

Email IP addresses

Repeated logins to the same website and IP addresses from e-mail activity have also been used to estimate international mobility patterns and users’ likelihood to move to another country (Zagheni & Weber, 2012). Rather than self-reported location by the user, certain online services such as email providers collect data on where users log into their accounts.

Reliability and validity The same limitation in terms of reliability and validity apply compared to social media data. Similar to log-ins to social media, log-ins to emails are usually recorded via devices (IP addresses) not necessarily people. For example, it is possible—yet presumably rare—that various people use the same email account which will distort any aggregate measure of migration.

Scope The scope of potential migration analysis is further reduced in the case of email log-ins given that additional socio-demographic and socio-economic information about the users (which are available for Facebook) is lacking or not publicly accessible.

Access Most email providers also do not provide public APIs that make data available to researchers. Email communication is considered personal and private communication whereas some communication on social media platforms is (intentionally or unintentionally) made public by users.

Ethics Similar to social media data, there are issues concerning consent and data privacy. Users may not be aware that email providers track their location. In other cases, governmental enforcement agencies may mandate companies to share content of emails of specific individuals for the purpose of criminal investigations or intelligence which bears the potential for misuse also in case of migrants in irregular settings (Brayne, 2018).

Online search data

Online search data has also been used more recently to study migration. Records on Google searchers, for example, have been explored to forecast the number of arrivals of asylum-seekers in Europe (Connor 2017) or internal migration within the U.S. (Lin et al., 2019). Search data generated through Google’s online search platform for migration can be exploited to measure migration intentions and predict subsequent emigration flows (Böhme et al., 2018; UN Global Pulse, 2014). For example, researchers retrieve data on how many times individuals in country A have ‘googled’ a term that the researcher believes to indicate an intention to migrate (to country B)—for example, ‘jobs’, ‘visa’, or the name of the destination country.

Reliability Google Searches are recorded consistently and provide high reliability in terms of the measure as such. The main advantage to Search Data is that Google’s search engine is widely used across the globe and has been successfully used to study other social behaviours (e.g. flu outbreaks). Despite broad coverage, important countries (i.e. China) are missing entirely. Issues of reliability emerge regarding applicability across various country contexts, languages and specific populations. Preliminary research in this area suggests that online searches (e.g. via Goole searchers) are related to actual movement at the aggregate level, yet the selection of specific search terms in various country contexts appears to be highly important. Syrians looking for ways to flee to Europe ‘google’ different terms than Canadians looking for a job in the US. The meaning of the same search terms may also vary in different languages. Overall, this means that Google searches may be indicative of migration from a certain country to another country, but difficult to scale up to multiple migration contexts (see Tjaden et al., 2021).

Validity Online search data has one obvious shortcoming: ‘searching’ is not ‘doing’. Just because someone looks up information on another country or, more explicitly, gathers information on how to move to another country, does not mean that they will actually move. Search data (similarly to survey data on emigration intentions) are a ‘pre-behavioural’ proxy for actual migration. Some studies suggest that intentions are a good predictor for eventual migration (Van Dalen & Henkens, 2013; Tjaden et al., 2019), but research also suggests that the strength of the predictor varies considerably based on where migrants are from and where they want to go (Tjaden et al., 2019).

Scope A major disadvantage of Google search data is the high level of aggregation at which data is made available. Search data is made available at the population level for countries or, in certain countries like the US, for subregions. Search data does not include any additional information about those who show interest in migrating, and thus renders any individual-level analysis impossible.

Access Google search data is freely and publicly accessible via the Google Trends platform and API.

Ethics The potential risk of misuse of data is limited given the high level of aggregation and anonymity of data which the company makes available. Serious concerns would arise when data for specific locations and IP addresses is used to infer individual level migration behaviour. Google itself is analysing individual-level location data to provide targeted advertisements to users who use their search engine. However, there is a lack of transparency in terms of the conditions under which such data may be shared with governments or other third parties. In addition, usual concerns around unawareness among users about the usage of their data apply.

Bibliometric data

Bibliometrics is a field of research that uses statistical methods to systematically analyse publications records (books, articles etc.). One sub-field of bibliometrics—scientometrics—is the analysis of scientific publications. Detailed information about academic output is recorded and made accessible through scientific databases (e.g. Scopus, Web of Science, Google scholar and others). This information has been used to model the international mobility of academics (Czaika & Orazbayev, 2018; Laudel, 2003; Moed & Halevi, 2014; Sudakova & Tarasyev, 2019; Wang et al., 2019). Changes in the researchers’ affiliation to institutions located in different countries indicates migration.

Reliability Measuring migration through changes in affiliations is consistent and reliable. Scientists have an interest to publish their work in recognized journals and books, institutions have an interest that researchers indicate their home institution, and most research outlets make it mandatory for authors to provide this information. Nevertheless, the data is sensitive to the accuracy of self-reported data which can be outdated.

Validity Migration analysis based on bibliometric data has the potential to collect additional context information including socio-demographic characteristics of the professionals (age, gender, ethnic origin, for example, may be inferred based on name recognition algorithms and web scraping individual professionals’ web pages). Additional information about the universities, faculty and chair may be matched with additional effort.

Scope The drawback of this data source is its restriction to a narrowly defined group of professionals (i.e. academics) where public access to their affiliation is the norm. However, it may be possible to extend this approach to other fields of professionals where public information on affiliations is common (i.e. athletes, musicians etc.).

Access Bibliographic data has become available through the digitalization of entire libraries, records of publishers, academic journals, and ambitious projects such as Google Books and Google Scholar that aim to record any academic publications that is published. Most academics provide their affiliations publicly to gain visibility and broaden their reach.

Ethics Compared to previously described sources, ethical concerns are limited because the personal information used for analysis is provided voluntarily and knowingly. The population is restricted to regular labour migrants which limits the potential for misuse by authorities.

Remote sensing technologies

Remote sensing is an umbrella term for collecting information about something without making physical contact. In current usage, remote sensing refers to the use of satellite or aircraft-based sensor technologies (i.e. drones). Remote sensing is commonly used in geography, earth sciences, climate research, agricultural studies, wildlife studies, military, and intelligence gathering, but also increasingly for urban planning, tourism, commerce, and various humanitarian applications (Miller et al., 2019). Changes in human activity visible in the images (i.e. settlements, refugee camps, light emissions at night) can be used to infer mobility.

Reliability If applied consistently, the approach to measuring migration using remote sensing technology by averaging physical quantities over pixels can yield reliable migration measures. Algorithms automatically detect changes in visual patterns on satellite or drone images over time. For example, the population size of settlements can be estimated by counting rooftops visible on satellite/drone images. Depending on the proximity and resolution of the image, individuals within certain localities can be identified. Comparing images over time can be used to estimate immigration and emigration into a certain, narrowly defined, location.

Validity The obvious downside of satellite and drone images for measuring migration is that no additional individual-level information about migrants is available: Who is moving, from where, to where, how etc. By itself, remote sensing provides information on how many tents, rooftops or individuals are present in a certain locality, but no information about what happened when there are less dots and shadows the next time new images become available.

Scope There is a rapidly growing number of examples with relevance for migration studies. First, drones and satellite images inform policies and direct aid to refugees. For instance, the United Nations Institute for Training and Research (UNITAR) mapped refugee camps in Jordan and elsewhere with its Operational Satellite Applications Programme.Footnote 11 Civil society organizations such as Human Rights Watch or Amnesty International use satellite imagery to document humanitarian needs of displaced populations at borders or in refugee camps by measuring the growth of settlements.Footnote 12 In this case, satellite images are providing an indication of where aid and assistance are most needed (Bitelli et al., 2017; Quinn et al., 2018; Shatnawi et al., 2020; Tiede et al., 2017).

Satellite imagery also forms a key part of the ‘smart border’ agenda, which attempt to use modern technology to improve border management around the world and track ‘illegal’ crossings. Systems relying on remote sensing were developed “to assist border authorities with more effective surveillance and reliable decision-making support” (Al Fayez et al., 2019). In contrast, civil society organizations use the same technology to monitor deaths and violations of migrants' rights at the maritime borders of the EU.Footnote 13

For the moment, remote sensing appears to be most useful for informing operations on the ground (managing refugee camps, targeting humanitarian assistance, managing borders etc.) and less for research on migration per se. The technology can also be used to monitor slow onset emigration rates due to changes in climate which can also be inferred from images.

Access With improvements in the quality and accessibility of satellite imagery (Popkin, 2018) provided by the European Space Agency, NASA, and others, researchers are also exploring ways to use remote sensing data to measure human migration globally. Public and private bodies offer access to satellite imagery for research purposes and tech companies offer cloud computing power to conduct complex and demanding analyses within minutes.Footnote 14Depending on the specific data provider, access can be free of charge to research institutes or come with a fee.

Ethics Ethical issues are a key concern for remote censoring technologies because information is collected without the knowledge or consent of individuals. New high-resolution satellite imagery and drone images can identify individuals using face recognition technology. Law enforcement, policing and intelligence agencies use such approaches (Brayne, 2018; Hayes, 2017; Leese et al., 2021; Molnar, 2019) which raises serious concerns regarding the situation in undemocratic countries with low data protection standards and policies aiming to suppress and control groups in society. Drones may also be increasingly operated by companies in addition to governments which raises concerns over unknown privacy violations by non-governmental actors.

International air travel

Upon first view, international air passenger traffic belongs to the realm of tourism and transport studies, not migration (see Sect. 2.1.). However, there have been attempts to use this information to infer migration flows. For example. Gabrielli et al. (2019) used dyadic monthly air passenger traffic between 239 countries and territories worldwide from January 2010 to March 2018 to estimate the number of passengers on commercial flights operated globally. The study explored whether a surplus in travel (increase in travel from A to B but no increase in return travel from B to A within a year) can be linked to migration flows.

Reliability Air passenger data is highly standardized and consistent as it is subject to international industry standards.

Validity Passenger data does not measure migration directly and can only be used to infer different types of migration by inference. Since air passengers data does not allow to track individual passengers or specific cohorts on the basis of their date of entry, researchers need to make assumptions about the length of stay of the passengers. This is problematic because the publicly available data does not indicate who the passengers are, how long they will stay in the country, on which visa they are travelling etc. In addition, flight passenger data is a selective picture of global mobility. 44 percent of registered cross-border travels occur through commercial flights, and that this proportion increases at rising distances between countries (Recchi et al., 2019).

Scope Overall, the data may be used to estimate international migration flows if combined with additional data sources. At the moment, the research is still in its exploratory stage and the methods appear underdeveloped. In the future, this approach may bear the potential to measure the level of visa overstayers between countries, one indicator of irregular migration.

Access Air passenger data is collected by flight companies which some make available for purchase. The EU recently made a public and free dataset available.Footnote 15

Ethics Ethical concerns regarding flight data are limited in its current state of the available data. Flight data is aggregated at the country and month level and anonymized. Currently, any misuse for the disadvantage for individuals is unlikely.

Online news data

New advances in technology have made available online news aggregators such as Google News or the Global Database of Events, Language, and Tone (GDELT).Footnote 16 Such platforms monitor the world's news media from nearly every corner of every country in print, broadcast, and web formats. This data has the potential to capture acts of past or prospective migration that were not covered in traditional sources such as administrative data or surveys.

Reliability Migration measures based on online news aggregator data can be considered reliable to the extent that algorithms deriving information on migration apply consistently across all countries and news sources. The issue is that the success of the algorithm in detecting migration may vary by country, by quality of the news outlets, by language and type of migration to be covered. In addition, algorithms may capture the same migration events several times as the same event may have been covered by several news outlets.

Validity The large volume of news articles required to collect information on migration encourages researchers to use language processing algorithms. The emerging evidence is still unclear on how accurately such algorithms may detect events that actually capture migration.

Scope Approaches are still very recent but several uses of this data are available. Carammia et al., (2020) have used the GDELT database to measure political, social, economic “push factor” events that could motivate people to leave their country. In combination with other data sources, they attempt to forecast displacement and migration with a view to set up early warning systems currently under development in the EU.Footnote 17 In a similar vain, the Internal Displacement Monitoring Centre (IDMC) uses the GDELT database to track internal displacement.Footnote 18 The IOM is also experimenting with such data to improve analysis of the number of migrants that went missing along their journeys, such as the IOM’s missing migrants project (Borja & Black, 2021). Apart from eye-witness reports, news articles are the main way to systematically collect data on migrant fatalities and bring light to this tragic topic for policymakers.

Access GDELT and Google News can be accessed online free of charge for researchers.

Ethics Ethical concerns arise in countries with low standards of journalism and data privacy. It is possible, for example, that the identity of individual migrants is revealed in a news article and picked up by automated text analysis. In theory, this information could be used by enforcement agencies to press charges in case of irregular migration or used by smugglers for debt collection. Even when interviewees provide consent for their personal information to be used, they may not be aware that their information may enter migration databases. Such abuses are possible with traditional media sources, however, digital applications may exacerbate the problem by providing cheaper, faster and broader access to data.

Discussion and conclusion

This review highlighted both the enormous opportunity of “big data” and “digital trace data” to complement traditional sources of migration data (see Cesare et al., 2018; Hilbert, 2016; Hughes et al., 2016; Laczko & Rango, 2014; Rango & Vespe, 2017; Sîrbu et al., 2021) and the main challenges and risks associated with such data. Several broader conclusions can be drawn from the above discussion on the eight discussed sources (see Table 1 for a summary):

Table 1 Overview of measurement approaches

The main advantages of digital data sources for migration scholars are captured by the first two V’s of the “3 V’s definition” of big data introduced in Sect. 3.2: Volume and velocity. In some cases, digital data sources provide information of millions of individuals in almost real-time. This provides migration researchers with the possibility explore migration trends where administrative data sources, surveys and censuses (sources traditionally used for inferring migration) are not available (such as in many low-income contexts), not accessible or too slow (for example in contexts of displacement and forced migration that are unfolding rapidly). Another key advantage is its granularity by providing “high-resolution” information. Digital data sources, especially approaches leveraging remote sensing technologies or mobile cell phones, often allow researchers to zoom into migration events at the sub-national, sub-regional or even local level. Nationally representative survey data, for example, often lacks sufficient sample size to disaggregate to the level of regions, districts or cities.

Unlike “traditional” sources, new data sources often make no distinction in terms of the “legal residence status” of individuals. Anyone using a digital service in included in the data. As a result, the volume, granularity and status-agnostics of many digital data sources offer new opportunities to collect information on “hard-to-reach” populations such as recent migrants, displaced or forced migrants and irregular migrants which are often excluded from ‘official’ data sources such as population registers or surveys (Cesare et al., 2018; Massey & Capoferro 2004; Reichel & Morales, 2017). Lastly, a key advantage to migration scholars is the fact that many data sources are accessible online and free of charge.

The review has also highlighted major challenges associated with using digital data sources for inferring migration. ‘Digital trace’ data is largely collected by private companies who offer services to their users and use user data to target advertisements or sell data to third parties. These data are not designed for research purposes. This has important implications both on ethical (Beduschi, 2020, Brayne, 2018; Cesare et al., 2018; Hayes, 2017, Latonero & Kift, 2018; Leese et al., 2021; Molnar, 2019; Sîrbu et al., 2020; Zwitter, 2014) and empirical grounds in terms of reliability and validity of migration measurement (Cesare et al., 2018, Ruel et al., 2016; Sîrbu et al., 2020).

First, there are severe ethical concerns regarding the use of digital data sources including the limited awareness of users regarding the extent of and purposes for which their data is being used and the risk of harm for individual migrants in cases where information is used by law enforcement, border management, intelligence agencies or smugglers (Brayne, 2018; Hayes, 2017; Molnar, 2019). The first rule that a researcher must follow is to acknowledge that data are people and can do harm (Sîrbu et al., 2021). Data may include information on particular vulnerable groups such as refugees, irregular migrants, persons displaced by disasters. Some may be persecuted by authorities in origin and destination countries. Researchers must ensure ethical standards for data use that protect vulnerable groups from identification and possible discrimination (Cesare et al., 2018).

Violations of data privacy and protection standards are especially concerning in undemocratic countries with low data protection standards, limited rule-of-law, and a lack of democratic norms. In extreme cases, new technologies can enable “digital authoritarianism” and “Orwellian state surveillance” (Dragu & Lupu, 2020). Three examples illustrate the extent of real risks: China’s social credit system, for example, leverages smartphone location data, social media communication, travel records, purchase records, camera data and facial recognition, among others, in combination with various administrative records to assign a social credit to its citizen. Low scores could be used to prevent access to a passport or visa needed to leave the country.Footnote 19 After departure of US troops, the Taliban have reportedly considered using US-made digital identity technology to persecute Afghans who have worked with the international coalition. Funded by millions of donor funding, Afghanistan’s National Statistics and Information Authority launched a digital biometric identity card including fingerprints, iris scans and a photograph, as well as voter registration databases.Footnote 20 In 2018, Bangladesh shared hundreds of thousands of data of Rohingya refugees collected by UNHCR with Myanmar then used to facilitate potential repatriation.Footnote 21

Unlike states, researchers often do not have access to individual-level information from digital sources, however, they must be aware of the potential harms for individuals and groups when using digital data sources and must review the data providers’ data protection and privacy standards. Researchers are advised to seek ethical approval by a scientific committee when dealing with digital data and migration.

Apart from ethics, the review has highlighted many empirical challenges. Reliability and validity of digital data sources for inferring migration must be considered when engaging in research. It is often not transparent how exactly key measures of migration are generated (i.e. Facebook, Google). There are also concerns regarding “generalizability” of digital data as the user base of various digital services is often selective and does not represent the general population at large (Cesare et al., 2018; Sîrbu et al., 2020). Digital data is often made available at highly aggregated levels. This severely limits the analytical potential for migration scholars who are often interested in measuring migration at the individual (micro) level. Moreover, digital data often remains “thin” offering very few additional information beyond time and location such as socio-demographic or socio-economic characteristics of migration, the reasons and channels of migration or the expected duration of stay. This further limits its analytical use especially if the data is not combined with other data sources. Traditional surveys usually do not face these issues (but many others) as they are tailored for answering specific research questions.

Lastly, many populations are excluded from a variety of digital data sources despite technological advances worldwide. For those data relying in digital traces such as social media or online searches, it is still a long way to obtain a comprehensive picture of global mobility as smartphone user penetration reached 38.5 percent in 2020 and half the world population is still offline today.Footnote 22

Looking ahead, several trends are already unfolding. First, new breakthroughs in measuring migration research will stem from a combination of different sources (e.g. Alexander et al., 2020; Sîrbu et al., 2020; Snijders et al., 2012). Given the ‘representativeness’ issue of digital trace data, traditional data is needed for “ground truthing” (i.e. cross-validating data by comparing it with other ‘official’ data sources). To address, the “thinness” of digital data, combining data provides large opportunities to add richness to the analysis. One example is using social media, online search, or mobile phone data to locate migration events or patterns and then target surveys in certain geographies or adjust administrative data collection accordingly (Alexander et al., 2020).

The second trend is further convergence and integration of academic disciplines around the issue of measuring migration (e.g. Miller et al., 2019). Different disciplines such as earth sciences, climate research, security studies, tourism and transport studies, computer science, sociology, economics, demography, ethnography, library sciences and political science bring different tools, methodologies and technologies to the table which will likely see the field become even more interdisciplinary. As a result, more interdisciplinary dialogue is needed to advance the field.

As the field is changing at a dizzying speed, this paper attempted to provide a brief overview and reflection on the main new digital data sources. The aim of this review was to provide incoming migration researchers with a menu of options and seasoned researchers with an update on new approaches. The information provided assists researchers in making difficult trade-offs when approaching their research question and policy analysts with a broad understanding of the limitations of the data they use.

The review has two obvious limitations. First, given the complex and rapidly growing field it seems near impossible to cover every existing approach and to cover the literature in all its breath. The review is focused on the main approaches without any claim to be exhaustive. Second, the field is rapidly evolving. This means that new research will have become available already by the time of publication.

The toolbox for migration researchers will become bigger, more diverse, but also more powerful due to new opportunities of digital data. Despite the ‘gold rush’ on big and digital data, the review also cautioned migration scholars in view of the many ethical and empirical obstacles for inferring migration based on digital data sources. The review aimed to contribute to a balanced understanding of these new data sources to facilitate knowledge accumulation and interdisciplinary dialogue in the field of migration studies.