Introduction

Accurate, reliable, and timely estimates of migration indicators, such as flows and stocks, are crucial for understanding population dynamics and demographic change, for designing effective economic, social and health policies, and for supporting migrants and their families. However, data on migration from traditional sources, such as censuses, surveys or administrative registers, are often insufficient. Even when these sources exist, the data available may lack the granularity of information required to understand migration trends, or are not released in a manner that is timely enough to monitor changes in trends.

As migration flows can change substantially over a short period of time—such as in response to a natural disaster or war and conflict—relying on out-dated data is often not sufficient.

Timely and reliable information about migration stocks is important not only to understand migration patterns. It is key also to monitor fertility, population health and mortality. Even when accurate data on births and deaths exist, often demographers are faced with large uncertainty in population counts, which, in various disaggregations, form the denominators for standard demographic rates. Most of the uncertainty is driven by lack of appropriate information on how migration stocks change over time and space.

As a consequence of data availability issues, we need to consider how non-traditional data can be leveraged to complement existing sources in order to improve estimates and predictions of migration indicators over time. Previous work has explored the use of data such as call detail records (Blumenstock 2012; Pestre et al. 2020), air traffic data (Gabrielli et al. 2019), tax file records (Engels and Healy 1981) and other sources like billing addresses or school enrollment (Foulkes and Newbold 2008) to estimate migration. Additionally, an increasingly large body of work has investigated the use of social media data, from websites such as Twitter (Zagheni et al. 2014), Facebook (Zagheni et al. 2017) and LinkedIn (State et al. 2014). Provided that the data can be obtained in a reliable, timely, and ethical way, information about the users of social media websites is potentially an incredibly rich demographic data source.

Data on these populations are essentially collected in real time, and while individual-level information is usually restricted, many of the social media websites provide a certain amount of aggregate-level information through their advertising platforms (Cesare et al. 2018). In particular, Facebook’s Advertising platform allows information to be extracted on the relative size of groups by key demographics such as age, sex, location of residence and country of origin, thereby potentially acting as a measure of the relative size of migrant groups in a particular country.

While these data have clear potential for use in demographic research, with respect to timeliness and the size of the sample being considered, there are some notable issues that need to be overcome. In particular, for any given population subgroup of interest, the corresponding users of Facebook or any other social media platform are unlikely to be a representative sample. An additional challenge is to use these data in a way that meaningfully combines new ‘signal’ or information about migration trends with existing knowledge on probable migration trends from historical data sources.

This paper proposes a statistical framework to combine social media data from Facebook, with traditional survey data from the American Community Survey (ACS), in order to produce timely ‘nowcasts’ of migrant stocks by state in the United States. The framework consists of a Bayesian hierarchical model which incorporates bias adjustment of the Facebook data, a demographic time series approach to account for historical past trends, and a geographic pooling component which allows information about the age structure of migrants to be shared across space. The model also accounts for the different types of uncertainty that are likely to be present in Facebook and traditional survey data. The resulting model produces estimates and short-term projections of migrant stocks by US state of destination and country of origin, and is shown to outperform valid modeling alternatives.

The remainder of the paper is structured as follows. First, we briefly discuss previous demographic research which incorporates social media. Then we outline the data sources used, and in particular how the Facebook data were collected. The "Model" section discusses the model set-up, assumptions, and computation. We then present results for Mexican, Indian and German migrants by US state, and validate model performance against reasonable alternatives. Finally, the strengths and limitations of the model are discussed, together with avenues for future research.

Background

The lack of good-quality data on migration is a global problem, with data sparsity issues prevalent in both developed and developing countries (Landau and Achiume 2017). This has prompted scholars to investigate the use of other types of data to monitor migration trends. In particular, with the rise of social media use around the world, new data that have potential for demographic research have emerged.

Scholars began using social media and web data to estimate and track demographic indicators over time in the early 2010s. The earliest papers illustrated how geo-located data from email services and web-based applications such as Twitter, Google Latitude, Foursquare or Yahoo! can be used (Ferrari et al. 2011; Noulas et al. 2011; Zagheni and Weber 2012). Initial research focused on evaluating spatial mobility of populations at a city or regional level. For example, Ferrari et al. (2011) used Twitter data to study patterns of urban movement in New York. In the first effort to tackle global trends, Zagheni and Weber (2012) linked the geographic locations of IP addresses of Yahoo! emails to the user’s self-reported demographic data to estimate age- and sex-specific migration flows in a large number of countries around the world.

Recent efforts have focused on using data from social media and networking websites such as Twitter, LinkedIn and, more recently, Facebook and Instagram. These websites provide public access to an Application Programming Interface (API), which makes it possible to send requests and receive responses from these websites for data like tweet hashtag counts, the number of jobs in a certain industry, or number of cell phone users in a particular area. Researchers have utilized these APIs to extract publicly available demographic and location data for use in social research, in particular to study outcomes such as migration (Yildiz et al. 2017; Zagheni et al. 2017), fertility (Rampazzo et al. 2018), gender equality (Fatehkia et al. 2018; Garcia et al. 2018), and health (Araujo et al. 2017). For instance, Garcia et al. (2018) used Facebook data to create an index of the internet gender divide in 217 countries, showing that this indicator encapsulated gender equality indices in education, health and economic opportunity. Yildiz et al. (2017) used a combination of geo-located tweets and image recognition software to obtain estimates of internal migration in England.

In work relevant to this paper, Zagheni et al. (2017) presented a proof of concept for estimating migration stocks in the United State by age, sex and state, using Facebook’s Advertising Platform. More recently, Alexander et al. (2019) used the same type of data to track changes in migrants over time, in the context of estimating out-migration from Puerto Rico following Hurricane Maria in September 2017.

The main gap in the literature is related to the lack of a suitable statistical model for combining ‘traditional’ data sources on migrants—in the form of censuses, nationally representative surveys, or other vital statistics—with migration information from social media data. The goal of this paper is thus to develop a probabilistic framework that allows representative and historical time series to be combined, in a sound statistical framework, with non-representative—but timely—sources from social media.

Data

Facebook Advertising Data

Facebook for Business has developed a targeted advertising platform, called Ads Manager that provides a graphical user interface to allow advertisers to micro-target specific audiences. Demographic characteristics that can be targeted include information directly reported by Facebook users, such as age or sex, and information indirectly inferred from using Facebook’s platform or affiliated websites, such as location and behavioral interests. Before launching an advertisement, an advertiser can select a variety of characteristics (e.g., Australians living in California, who are female, and aged 30–35) and get an estimate of the ‘potential reach’ (monthly active users) to this subgroup. These estimates can be obtained, in a programmatic way, for a variety of different migrant groups (i.e., by age and sex) to whom Facebook refers as expatriates (‘expats’).

We use the estimates of potential reach by expat group, age and sex to track sizes of migration stocks over time. These estimates can be obtained before the launch of an advertisement, and as such are obtained free of charge. We use the Ads Manager back-end application, Facebook’s Marketing API, to extract estimates of potential reach over time programmatically with the Python module pySocialWatcher (Araujo et al. 2017). With pySocialWatcher, we collected data across 11 age groups (10 UN age groups from 15–19 to 60–65; an 11th group for the entire available Facebook population of 13–65 was also used) and three gender groups (female, male, and total population). Data was collected using Amazon Web Services (AWS) EC2 Instance servers.

As part of a broader project on using social media in demographic research, we started data collection in January 2017 and collect a new wave of data every 2-3 months. For each wave of data collection we obtained state-level estimates of all Facebook users (by age, sex, and gender) as well as state-level estimates of 90 expat groups.Footnote 1

American Community Survey

The American Community Survey (ACS) is an annual survey of the U.S. Census Bureau, designed to supplement the decennial census. Based on the long-form version of the census, the ACS collects information on topics including population, housing, employment and education from a nationally representative sample. Data on migrant stocks can be readily obtained from the ACS. In particular, in every year of the ACS, the survey has contained a question asking the birthplace of the person; if it is inside the United States, the state is recorded, and if it is outside the United States, the country is recorded. This birthplace variable is recorded as a three digit code to indicate the US state or country of birth. In addition to the birthplace variable, the ACS has information on current state of residence. Thus, we can tabulate the number of migrants from a particular country living in a particular state by looking at the combination of these two variables. From a modeling perspective, we are interested in the proportion of migrants from a particular origin of the total population by 5-year age groups (15–19, 20–24,..., 50–54) in each state.

We calculated the migrant stock proportions using the 1-year ACS for each year between 2001 and 2017 using micro-data available through the Integrated Public Use Microdata (IPUMS) US project (Ruggles et al. 2000). Standard errors around the calculated proportions based on sampling variation were calculated based on ACS accuracy guidelines (US Census Bureau 2020) and using the delta method.

Model

We are considering two data sources of migration trends in the US: data from Facebook’s Advertising Platform, and the ACS. The overall goal of the modeling strategy is to combine information from both these sources to produce estimates of current and future migrant stocks. To do this, the model should have three main characteristics. Firstly, we want to adjust for biases in Facebook data to effectively use up-to-date information on migration patterns from this source. Secondly, we want to be able to incorporate longer time series of information from the ACS. Finally, the data should be combined in a probabilistic way, in order to objectively weigh information from both sources. We propose a Bayesian hierarchical model which achieves these goals. In this section we describe the model in detail.

For a particular migrant group, define \(\rho _{x,t,s}\) to be the proportion of migrants of the total populations in age group \(x\) at time \(t\) and in state \(s\). This quantity \(\rho _{x,t,s}\) is the main parameter of interest to be estimated. We have observations of this proportion, which will be denoted \(p_{x,t,s}\). The observed proportions are either from Facebook (\(p_{x,t,s}^{FB}\)) or from the ACS (\(p^{ACS}_{x,t,s}\)). The \(p_{x,t,s}\)s are observed, and it is assumed that these are somehow related to the latent proportions, \(\rho _{x,t,s}\), with some associated error.

Facebook Bias Adjustment

The first goal is to adjust the Facebook data to account for the non-representativeness of the Facebook user population. Previous research from Zagheni et al. (2017) showed that, while the bias in the Facebook migrant data is substantial, it is also relatively systematic by age and migrant group and can be modeled.

Following their approach, we introduce a regression model which relates the proportions of migrants in Facbook, \(p_{x,t,s}^{FB}\), to the proportions in the ACS in a similar time period, \(p_{x,t,s}^{ACS}\), plus a series of age and state variables. In particular, for a particular migrant group, express \(p_{x,t,s}^{ACS}\), on a log scale, as

$$\begin{aligned} \log p_{x,t,s}^{ACS} = \alpha _0 + \alpha _1 \log p_{x,t,s}^{FB} + \beta \mathbf {X} + \varepsilon _{FB} \end{aligned}$$
(1)

where \({\mathbf {X}}\) is a covariate matrix containing an indicator variable for each age group (15–19, 20–24,..., 50–54) and each of the 50 states plus Washington D.C. This means that we estimate a fixed effect for each age group and state. In addition, we assume that the error is i.i.d. and that

$$\begin{aligned} \varepsilon _{FB} \sim N(0, \sigma ^2_{FB}). \end{aligned}$$
(2)

Estimates of the coefficients \(\alpha _0\), \(\alpha _1\) and the vector of \(\beta\)’s are obtained using the first wave of the Facebook data and the 2016 ACS data. Once obtained, these coefficient estimates are then used to adjust subsequent waves of Facebook data, i.e. we calculate

$$\begin{aligned} \log p_{x,t,s}^{*} = {\hat{\alpha }}_0 + {\hat{\alpha }}_1 \log p_{x,t,s}^{FB} + {\hat{\beta }} \mathbf {X} \end{aligned}$$
(3)

where \(\log p_{x,t,s}^{*}\) is a ‘bias-adjusted’ version of the Facebook data. This is taken to be our ‘best guess’ of what the migrant stocks in group \(x,t,s\) are, based on the Facebook data alone. Note that an estimate of \(\sigma ^2_{FB}\) is also obtained, that is, the variance of the error terms, which becomes important in the final model (see the "Bringing It All Together" section).

Time Series Modeling of ACS Using Principal Components

In addition to using data from Facebook, we also want to incorporate the relatively long historical time series of information on migrant stocks obtained from the ACS. A reasonable short-term forecast based on ACS should model historical trends and project them forward.

There are many different time series models that could be used in this context. Perhaps the simplest approach would be to project forward a moving average of the time series for each age group and state combination. Alternatively, we could use a classical Box–Jenkins approach and model the time series of migrant stocks in each age group and state separately using an appropriately specified ARIMA model. However, these methods would not place any constraints on the age structure of migration. Given this demographic context, we expect that the age distribution of migration displays strong patterns and changes in a relatively regular way over time. This is because of regularities in the age at migration as well as historical trends which include different waves of migrants, who also age over time. As such, we chose to incorporate this prior knowledge into our model through a principal components approach.

Principal component-based models have a long history in demographic modeling, with the most well-known example being the Lee–Carter mortality model (Lee and Carter 1992). The idea is that a set of age-specific demographic rates observed over time can be expressed as a combination of a series of characteristic patterns (or principal components). The Lee–Carter approach uses the mean age-specific mortality schedule and first principal component, which is interpreted as age-specific contributions to mortality change over time. This model can easily be extended to include higher-order principal components, which various researchers have done.

Apart from the Lee–Carter model and variants (e.g. Li et al. (2004), Lee (2000), Renshaw and Haberman (2006)), principal component models were recently used to estimate and forecast mortality (e.g. Alexander et al. (2017)), fertility (e.g. Schmertmann et al. (2014)) and overall population (Wiśniowski et al. (2015)). Here, we extend this idea to parsimoniously estimate and project migration stocks by age and state.

Model Overview

Age-specific migration schedules are decomposed into independent age and time components. The time component is then projected forward as a time series, taking autocorrelated error into account. We propose a log-linear model for \(p_{x,t,s}\):

$$\begin{aligned} \log p_{x,t,s}^{ACS} = \beta _{t,s,1} Z_{x,1} + \beta _{t,s,2} Z_{x,2} + \varepsilon _{x,t,s} \end{aligned}$$
(4)

where \(Z_{x,1}\) and \(Z_{x,2}\) are the first and second ‘principal components’, \(\beta _{t,s,1}\) and \(\beta _{t,s,2}\) are state and time-specific coefficients, to be estimated, and \(\varepsilon _{x,t,s}\) is an error term. The principal components are obtained via Singular Value Decomposition (SVD), as outlined in the next section. To obtain estimates of \(\beta _{ts,1}\) and \(\beta _{ts,2}\), we impose some smoothing over time and pooling of information across space, as outlined in the "Time Series Modeling of ACS Using Principal Components" section. Finally, as discussed in the "Time Series Modeling of ACS Using Principal Components" section we place a time series model on the error term, \(\varepsilon _{x,t,s}\), to account for autocorrelation.

Obtaining the Principal Components

The principal component terms \(Z_{x,1}\) and \(Z_{x,2}\) aim to capture the main sources of systematic variation in migration patterns across age. They are obtained by first creating a matrix of (logged) historical age-specific migration schedules based on ACS data from 2001 to 2016. Singular Value Decomposition (SVD) is then performed on this matrix to obtain principal components of the age-specific migration. In particular, let \(\mathbf {X}\) be a \(N \times G\) matrix of log-migration stock rates, where \(N\) is the number of state-years and \(G\) is the number of age groups. In this case, we had \(N = 51\) states + DC \(\times 16\) years \(= 816\) observations of \(G = 9\) age groups (15–19, 20–24,..., 50–54). The SVD of \(\mathbf{X}\) is

$$\begin{aligned} \mathbf{X} = \mathbf{UDV'}, \end{aligned}$$
(5)

where \(\mathbf{U}\) is a \(N \times N\) matrix, \(\mathbf{D}\) is a \(N \times G\) matrix and \(\mathbf{V}\) is a \(G \times G\) matrix. The first two columns of \(\mathbf{V}\) (the first two right-singular values of \(\mathbf{X}\)) are \(Z_{x,1}\) and \(Z_{x,2}\).

For example, Fig. 1 shows the resulting \(Z_{.,1}\) and \(Z_{.,2}\) for the Mexican migrant group in the US. These were obtained via the following steps:

  1. 1.

    Calculate \(p_{x,t,s}^{ACS}\), i.e. the proportion of migrants in age group \(x\), year \(t\) and state \(s\) for each age group, year and state in ACS 2001–2016.

  2. 2.

    Create \(\mathbf {X}\) where each element is \(\log p_{x,t,s}^{ACS}\), every row is a state-year and every column is an age group.

  3. 3.

    Perform SVD on \(\mathbf{X}\) and extract the first two columnsFootnote 2 of \(\mathbf{V}\).

Fig. 1
figure 1

Principal Components for Mexico

The principal components shown in Fig. 1 can be interpreted as a baseline migration age schedule (\(Z_{.,1}\)) and age-specific contributions to change over time (\(Z_{.,2}\)). In the model, the coefficient on \(Z_{.,1}\) (\(\beta _{t,s,1}\)) moves the overall level of Mexican migrants up or down, depending on the year and state. The coefficient on \(Z_{.,2}\) allows the age distribution to shift to older or younger ages. For \(Z_{.,2}\), the sign changes from negative to positive at age 35. This means that the larger and more positive the value of \(\beta _{t,s,2}\), the older the migrant age distribution.Footnote 3

Sharing Information Across Time and Space

The model specified in Eq. 4 requires the estimation of two coefficients, \(\beta _{ts,1}\) and \(\beta _{ts,2}\) for each time \(t\) and state \(s\). One option would be to estimate each of these coefficients separately for every year and state. However, we would like to incorporate the knowledge that trends in migration over time are likely to exhibit relatively regular patterns. In addition, for the coefficient on the second principal component—which allows for the age distribution of migrants to shift to the left or right, we would like to share information about the patterns in migration across geographic space.

The coefficient on the first principal component, \(\beta _{t,s,1}\), is modeled as a random walk, i.e.

$$\begin{aligned} \beta _{t,s,1} \sim N(\beta _{t-1,s,1}, \sigma _{\beta _1}^2) \end{aligned}$$
(6)

This allows for information about the level of migration within each state to be smoothed over time. The random walk structure allows for the estimate in the current time period, \(\beta _{t,s,1}\), to be partially informed by the previous period.

For the coefficient on the second principal component, we place the following hierarchical structure on the \(\beta\)’s:

$$\begin{aligned} \beta _{t-1,s,2} \sim N(\Phi _{t}, \sigma _{\beta }^2) \end{aligned}$$
(7)
$$\begin{aligned} \Phi _{t} \sim N(\Phi _{t-1}, \sigma _{\Phi }^2) \end{aligned}$$
(8)

The \(\Phi _{t}\) term represents essentially a national mean; as such the \(\beta _{t,s,2}\)’s are a draw from a national distribution with some mean and variance. In this way, information about how the age distribution is ageing over time is shared across states. The more information about migration there is available for a particular state (i.e., the larger the migrant population), the less the estimate of \(\beta _{t,s,2}\) is influenced by the overall mean. Conversely, states with smaller migrant populations where the trends over time are less clear from the data are partially informed by patterns in larger states.

Note that the geographical hierarchical structure is not present on the first coefficient, as this represents an overall level of migration. Pooling information across space about the level of migration would artificially increase migrant proportions in smaller states.

Auto-Correlated Error

The final piece of the time series model is the error term \(\varepsilon _{x,t,s}\). This term is included in the model to allow for extra variation in migration age schedules that is not otherwise picked up by the principal components. We expect the extra variation to be autocorrelated, and as such we model the error term as an AR(1) process:

$$\begin{aligned} \varepsilon _{x,t,s} \sim N(\rho _{x,s}\varepsilon _{x,t-1,s}, \sigma _{\varepsilon }^2) \end{aligned}$$
(9)

where \(\rho _{x,s} \in [0,1]\).

Projection

The model described above is fit to ACS data from 2001 to 2016. However, estimates in more recent years can easily be obtained by projecting the time series aspects of this model forward. In particular, for time \(t+1\):

  • Obtain an estimate for \(\beta _{t+1,s,1}\) from \(\beta _{t+1,s,1} \sim N(\beta _{t,s,1}, \sigma ^2_{\beta })\).

  • Obtain an estimate for \(\beta _{t+1,s,2}\) from \(\beta _{t+1,s,2} \sim N(\Phi _{t+1}, \sigma ^2_{\Phi })\) and \(\Phi _{t+1} \sim N(\Phi _{t}, \sigma _{\Phi }^2)\).

  • Obtain an estimate for \(\varepsilon _{x,t+1,s}\) from \(\varepsilon _{x,t+1,s} \sim N(\rho _{x,s}\varepsilon _{x,t,s}, \sigma _{\varepsilon }^2).\)

  • Calculate \(\log p_{x,t+1,s}^{ACS}\) based on Eq. 4.

Bringing It All Together

The "Facebook Bias Adjustment" and "Time Series Modeling of ACS Using Principal Components" sections described two ways to obtain current ‘nowcasts’ of migrant stocks. One option would be to take the most recent data obtained from Facebook, adjust using the bias-adjustment model, and take the resulting estimate as our nowcast. Another option would be to project forward the ACS model to the time period of interest. Ideally, we would like to incorporate both sources into our final estimate. Perhaps an option to do this would be to just take an average of the two resulting estimates. However, we would like to weigh the estimates from both sources more objectively, taking different sorts of uncertainty into consideration.

Our solution is to combine both models into one framework, and use the results from both methods as data points for our ‘best estimate’ nowcast. This is illustrated in Fig. 2. Facebook inputs are calibrated with the ACS via the adjustment model. ACS data are used to obtain principal components based on past migration data. The modeling structure allows for information exchange over time and across geographic space. The key piece of the combined model, which has yet to be explained, is the data model (or likelihood), which allows data from the different sources to have different associated error.

Fig. 2
figure 2

Modeling framework

Data Model

As outlined above, we observe migrant proportions \(p_{x,t,s}\) from either Facebook or the ACS. The data model assumes

$$\begin{aligned} \log p_{x,t,s} \sim N(\log \rho _{x,t,s}, \sigma _p^2) \end{aligned}$$

i.e. the log of the observed proportion is assumed to have mean \(\log \rho _{x,t,s}\) and variance \(\sigma _p^2\), where \(\sigma _p^2\) depends on the data source:

$$\begin{aligned} \sigma _p^2 = {\left\{ \begin{array}{ll} \sigma _s^2,&{} \text {if ACS} \\ \sigma _s^2 + \sigma _{FB}^2 + \sigma _{ns}^2, &{} \text {if Facebook} \end{array}\right. } \\ \end{aligned}$$

Here, \(\sigma ^2_s\) refers to sampling error, and is assumed to be present in both ACS and Facebook data. For the ACS data, sampling errors are calculated based on guidelines from the US Census Bureau (2020). For Facebook data, the sampling error is calculated assuming the binomial approximation to the Normal distribution and calculating

$$\begin{aligned} \sigma ^2_s = \frac{p_{x,t,s} \cdot (1-p_{x,t,s})}{N_{x,t,s}^{FB}} \end{aligned}$$

where \(N_{x,t,s}^{FB}\) is the total size of the Facebook population in subgroup \(x,t,s\). The sampling errors were then transformed to be on the log scale using the delta method.

For the Facebook data there are two additional error terms. \(\sigma _{FB}^2\) refers to the error associated with our bias-adjustment model (Eq. 1) and is estimated within this model. This captures the fact that our adjustment model is imperfect and that extra variation remains. Additionally, we allow for a non-sampling error with \(\sigma _{ns}^2\), which aims at capturing additional uncertainty like variation in the way potential reach is estimated across waves.

For a given population size, the sampling error is going to be of similar size for ACS and Facebook data. As such, the error term associated with the Facebook data, which is the sum of three terms, will always be bigger than for ACS. In practice, this means that estimates from the model will follow (i.e. give more weight to) the ACS data.

Summary of Full Model

The full model is summarized below. Equation 10 is the data model. Equations 1115 relate to the ACS time series model. Equation 18 relates to the Facebook regression model. Equations 16 and 17 allow the observation of the proportion of interest to come from a different source (Facebook or ACS), which has a different associated variance. Note that \(\rho _{x,t,s}\) is estimated on a yearly basis, but it is assumed that \(j\) waves of Facebook data are collected within any 1 year.

$$\begin{aligned} \log p_{x,t,s}\sim & {} N(\log \rho _{x,t,s}, \sigma ^2) \end{aligned}$$
(10)
$$\begin{aligned} \log \rho _{x,t,s}= & {} \beta _{t,s,1} Z_{x,1} + \beta _{t,s,2} Z_{x,2} + \varepsilon _{x,t,s} \end{aligned}$$
(11)
$$\begin{aligned} \beta _{t,s,1}\sim & {} N(\beta _{t-1,s,1}, \sigma _{\beta _1}^2) \end{aligned}$$
(12)
$$\begin{aligned} \beta _{t,s,2}\sim & {} N(\Phi _{t,2}, \sigma _{\beta }^2) \end{aligned}$$
(13)
$$\begin{aligned} \Phi _{t}\sim & {} N(\Phi _{t-1}, \sigma _{\Phi }^2) \end{aligned}$$
(14)
$$\begin{aligned} \varepsilon _{x,t,s}\sim & {} N(\rho _{x,s}\varepsilon _{x,t-1,s}, \sigma _{\varepsilon }^2) \end{aligned}$$
(15)
$$\begin{aligned} p_{x,t,s}= & {} {\left\{ \begin{array}{ll} p_{x,t,s}^{ACS}, &{} \text {if } 2001\le t \le 2016\\ p_{x,t,sj}^*, &{} \text {if } t\ge 2017 \end{array}\right. } \end{aligned}$$
(16)
$$\begin{aligned} \sigma ^2= & {} {\left\{ \begin{array}{ll} \sigma _s^2,&{} \text {if ACS}\\ \sigma _s^2 + \sigma _{FB}^2 + \sigma _{ns}^2, &{} \text {if Facebook} \\ \end{array}\right. } \end{aligned}$$
(17)
$$\begin{aligned} p_{x,t,sj}^*\sim & {} N ({\alpha _0} + {\alpha _1} \cdot p_{x,t,s,j}^{\text { Facebook}} + X{\Gamma }, \sigma ^2_{FB}) \end{aligned}$$
(18)

Priors

Weakly informative priors were placed on the coefficients in the Facebook bias-adjustment model, as well as the principal component coefficients in the initial periods:

$$\begin{aligned} \alpha _0\sim & {} N(0, 100)\\ \alpha _1\sim & {} N(0, 100)\\ \Gamma _0\sim & {} N(0, 100)\\ \beta _{1,s,1}\sim & {} N(0, 100)\\ \Phi _{1}\sim & {} N(0, 100) \end{aligned}$$

In addition, we put weakly informative half-Normal priors on the two standard deviation terms to be estimated:

$$\begin{aligned} \sigma _{FB}\sim & {} N_+(0, 1)\\ \sigma _{ns}\sim & {} N_+(0, 1). \end{aligned}$$

Computation

The model was fitted in a Bayesian framework using the statistical software R. Samples were taken from the posterior distributions of the parameters via a Markov Chain Monte Carlo (MCMC) algorithm. This was performed using JAGS software (Plummer et al. 2003). Standard diagnostic checks using trace plots and the Gelman and Rubin diagnostic were used to check convergence (Gelman et al. 2013).

Best estimates of all parameters of interest were taken to be the median of the relevant posterior samples. The 95% Bayesian credible intervals were calculated by finding the 2.5% and 97.5% quantiles of the posterior samples.

All code and data are available on GitHub: https://github.com/MJAlexander/fb-migration-bayes.

Results

We illustrate the model on male migrants from three different countries: Mexico, India, and Germany. We chose these three migrant groups as they represent a range of different levels and trends over time, as illustrated by the trends in the ACS data shown in Fig. 3a and b.

Firstly, Mexican migrants make up a relatively large share of the overall population, but the proportion has generally declined since around 2007. The age distribution at the national level peaks in the 40–44 year old age group. Secondly, Indian migrants make up a moderate proportion of the total population, but this share is increasing over time. The age distribution peaks at younger ages (30–34), compared to Mexicans. Finally, German migrants make up a low and declining share of the population. In contrast to the other migrant groups, the age distribution of German migrants at the national level is relatively flat, increasing slightly across age.

Fig. 3
figure 3

German, Indian and Mexican migrants

Bias Adjustment of Facebook Data

We firstly illustrate the results of the bias-adjustment step of the Facebook data. Figure 4 shows, for each US state and five-year age group where data are available, the proportion of migrants in each age group for the ACS data in 2016 (black dots), the un-adjusted Facebook data (gray dots), and the estimated bias-adjusted Facebook data (line and associated shaded area) for Mexican migrants. Similar plots for migrants from India and Germany are shown in "Appendix A". The interpretation is that if the bias-adjustment step is working reasonably well, the estimate line would be close to the black dots. In general, this appears to be the case. In the case of Mexico, bias-adjusted estimates have a root mean squared error (RMSE) of 0.01, when compared to the ACS data; this is reduced by half from the RMSE of the raw Facebook data.

For all three migrant groups, the raw Facebook data are generally lower than the ACS data, but the bias-adjustment model adjusts these values upwards. In general, across the three migrant groups and across states, the shape of the age distributions in the Facebook and ACS data are similar, with more substantial under-representation in Facebook in the older age groups. These systematic differences mean that the model works well to adjust the raw Facebook data based on age and state effects.

Fig. 4
figure 4

Bias adjustment of Facebook data for Mexican migrants

Nowcasts by Age Group and State

Now we move on to short-term projections by age and state. Figure 5 shows the estimated age distribution in 2008 (dark gray) and projected distribution in 2018 (light gray) for Mexican migrants. Similar plots for India and Germany can be found in "Appendix B".

For Mexico (Fig. 5), the relatively high proportions in the border states and on the West coast are apparent, with the highest proportions in California, Texas, Nevada and Arizona. Additionally, the age distribution of Mexican migrants is generally aging over time (shifting to the right), which is consistent with relatively constant stocks.

Fig. 5
figure 5

Estimated and projected age distributions of Mexican migrants by state, 2008 and 2018

Projected Time Series

Figures 6, 7 and  8 zoom in on two states for each migrant group and show how the Facebook data are used to project forward the time series to the most recent 2 years (2017 and 2018). The full estimated and projected time series from 2001 to 2018 is shown. In the figures, each facet is a 5-year age group. The black crosses area represent the ACS data; these data are broadly available from 2001 to 2016, although some observations are missing (if sample sizes in the ACS were too small to capture information about migrants in that particular state and age group). The medium gray dots represent the (adjusted) Facebook observations, which are available in years 2017 and 2018. The light gray line and associated shaded area is the model estimate and 95% uncertainty intervals.

Mexican males in California (Fig. 6a) represent by far the highest proportions of any of the migrant origin/ state combinations considered. The proportion is as high as 0.25 in some age groups, for example 25–29 year olds in 2001 and 40–44 year olds in 2018. As a consequence, the sampling error around the ACS data for this migrant group is relatively small and the model estimates closely follow these data. For the most recent two years, where only Facebook data are available, note that the model estimates do not follow the data as closely and the uncertainty around the model estimates increases. This reflects the fact that there are more sources of error associated with the Facebook data.

In Georgia (Fig. 6b) the levels of Mexican migrants are around half as high as in California. Due to smaller sample sizes in Georgia, the standard errors around the ACS data are much larger, and as such the model estimates do not follow the data as closely. However, the trends for Mexican migrants in California, and Georgia are broadly the same: decreases in the younger age groups, and increases in the older age groups, representing an aging stock of migrants.

For Indian males in California and Georgia (Fig. 7), the proportions are much lower than for the Mexican migrant population, peaking at around 3–4% of the population in the 30–39 year old age groups. The proportions are increasing over time, however, particularly in the 25–44 age bracket. Finally, for German male migrants in California and Georgia (Fig. 8), we see low and constant migrant proportions. The uncertainty around the ACS data is already relatively high, and so there is not so much of an increase in uncertainty in the final two years.

Fig. 6
figure 6

Mexican male migrants by age group, California and Georgia, 2001–2018

Fig. 7
figure 7

Indian male migrants by age group, California and Georgia, 2001–2018

Fig. 8
figure 8

German male migrants by age group, California and Georgia, 2001–2018

Validation

We evaluated the performance of the Bayesian model compared to other reasonable forecasting alternatives. To do this, we ran the model on data from 2001 to 2016, and forecast migration stocks in 2017. We then compared these forecasts to the actual ACS data in 2017. We compared the accuracy of the Bayesian model forecast to forecasts produced by three other models:

  1. 1.

    Three-year moving average of the ACS data. This is one of the simplest options available and does not require the Facebook data or any statistical modeling.

  2. 2.

    Facebook data only Estimates are based just on the available Facebook data in 2017, after it has been adjusted for biases.

  3. 3.

    ACS time series model Here, we ran the Bayesian hierarchical time series model described in the "Model" section above, but just using data from the ACS (no Facebook).

In order to assess model performance, we compare the root mean squared error (RMSE):

$$\begin{aligned} \mathrm{{RMSE}} = \sqrt{\frac{\sum _n \left( {\hat{p}}_{g, 2017} - p^{ACS}_{g 2017}\right) ^2}{N}} \end{aligned}$$
(19)

where \({\hat{p}}_{g, 2017}\) is the estimated proportion of migrants from a particular group g, \(p^{ACS}_{g 2017}\) is the equivalent proportion from the ACS and N is the size of the group. Here, the g can refer to any combination of age group, state and migrant origin.

Table 1 shows the overall RMSE for the four models for Mexican, Indian and German migrants. The main result is that in each of the three migrant groups, the Bayesian model presented (which combines the ACS and bias-adjusted Facebook data and thus is referred to as the ‘combined model’), produces the lowest RMSE and thus the most accurate forecasts. The overall results also illustrate that the Bayesian hierarchical time series model produces substantially more accurate forecasts compared to a simple moving average or the bias-adjusted Facebook data alone, producing RMSEs that are up to an order of magnitude smaller. This gain in accuracy is much larger than the gain moving from ACS-only to the combined model, although there is still a gain in each case.

Table 1 Overall RMSE by model and migrant origin

Figure 9 illustrates the RMSE by age group and model type for each of the three migrant groups. Similar plots by state can be found in "Appendix C". For Mexico, there is generally an incremental decline in the RMSE moving from the Facebook-only model, to moving average, to ACS, to the combined model. For all but the Facebook-only model, the RMSE is highest in the 30–34 year old age group, which is also where the proportion of migrants is highest (see Fig. 6). For India (Fig. 9b), the RMSE is particularly high from the Facebook-only model. For Germany, the gain in accuracy in moving to the hierarchical time series set-up is most noticeable (that is, the improvement over the ‘ACS’ model compared to the Facebook-only or moving average). This is most likely related to the fact that the proportions of German migrants are in general a lot lower than for Mexico or India, and so there are noticeable gains in pooling information across state, age and time. It should be noted that the combined model does not always have a lower RMSE for every age and state combination, but as shown in Table 1, the combined model performs better overall for every migrant group.

Fig. 9
figure 9

RMSE by age group and model

To summarize, the validation exercise comparing the 1-year-out predictions from a range of models to the migrant proportions reported in the ACS illustrates both (i) the strength of the proposed Bayesian hierarchical time series model as a general framework, and (ii) the additional information obtained from including up-to-date Facebook data compared to just historical ACS data alone.

Discussion

As the size and frequency of migration movements continue to increase worldwide, new sources of data are being considered in order to better understand both historical and future migration trends. There is a growing body of work considering the feasibility of using social media data to achieve such goals, from platforms such as Facebook, Twitter and LinkedIn. While the granularity of social media use varies widely (from individual geo-tagged tweets, to aggregated advertising demographic data, as is in this paper), the common challenges of using such data remain: firstly, to adequately adjust for known biases in the social media data, primarily as a consequence of the non-representativeness of the population of social media users; and secondly, to meaningfully combine information from social media data with information from more traditional data sources, such as surveys or censuses.

In this paper we presented a statistical framework to achieve these goals in the context of producing short-term projections of migrant stocks in the United States. The model includes a bias-adjustment process of the Facebook data, and a ‘principal components time series’ model, which allows for the projection of trends in stocks into the future, considering both Facebook and ACS data. The model allows for different types of uncertainty around the different data sources, and shares information on migration trends over time and pools across geographic space. Illustrative results were presented for three separate migrant groups: Mexicans, Indians, and Germans. The results of the validation exercise, comparing projections with 2017 ACS data, suggest that the proposed model improves prediction of short-term trends when compared to viable alternatives.

The validation exercise illustrated the substantial gain in accuracy achieved when moving to the Bayesian hierarchical time series model, regardless of whether or not the Facebook data were included. While the benefits of including the Facebook data in this particular case were relatively marginal, more generally the Facebook data has the advantage of being up-to-date and essentially available in real time. Thus, in a situation of a ‘shock’, such as a natural disaster or other event, the collection of Facebook data allows for the production of a timely estimate of the effects of that shock on migration. The combination of these data with past trends allows for the identification of surprising increases or decreases, that are out of the expected bounds based on historical patterns.

There are several limitations of the proposed model, which naturally lead into avenues for future work. Firstly, the bias-adjustment model assumes that the systematic bias in the Facebook data (by age and state) is constant over time. In reality, it is reasonable to believe that the biases in the Facebook data are changing over time, as the composition of the underlying Facebook population changes. The relationship between the age/location composition of the Facebook and the actual population (as measured by the ACS) could be investigated in future work. Secondly, the bias-adjustment model also assumes that the non-sampling error is constant over Facebook’s ‘waves’ of data collection—that is, sources of error that include changes in how the reach population is calculated, or other computational reasons, are assumed to be constant. In practice, and in other work using these data (Alexander et al. 2019), we have observed that this is probably not the case, and needs to be further investigated to better understand non-migration-related fluctuations over time. While we only consider two data sources in this paper, the general statistical framework could easily be extended to include information from other sources. Future work could also include taking advantage of the rich demographic and socioeconomic data available through the Facebook Advertising Platform, including information on education and occupation.

While this work focused on a model for the estimation of migrant stocks, the philosophy of combining social media data with more traditional data sources in one statistical framework—allowing for different sources of uncertainty—can be readily extended to model other demographic indicators. Indeed, the underlying time series model is itself an extension of principal component techniques that were previously used in demography to study mortality and fertility. Our framework shows the strength of combining more traditional demographic modeling techniques, survey data, and novel social media data to gain insights into underlying population processes.

In conclusion, we developed a statistical method for combining social media data with survey data to produce nowcasts of migrant stocks in the United States. We illustrated the method on three migrant groups in the United States with differing patterns. The statistical framework presented could be extended to include other data sources or nowcast other migration indicators in different contexts.