1 Introduction

Probability sampling is regarded as the gold standard in survey statistics for finite population inference. Fundamentally, probability samples are selected under known sampling designs and, therefore, are representative of the target population. Because the selection probability is known, the subsequent inference from a probability sample is often design-based and respects the way in which the data were collected; see Särndal et al. (2003), Cochran (1977) and Fuller (2009) for textbook discussions. Kalton (2019) provided a comprehensive overview of the survey sampling research in the last 60 years.

However, many practical challenges arise in collecting and analyzing probability sample data (Baker et al. 2013; Keiding and Louis 2016). Large-scale survey programs continually face heightened demands coupled with reduced resources. Demands include requests for estimates for domains with small sample sizes and desires for more timely estimates. Simultaneously, program budget cuts force reductions in sample sizes, and decreasing response rates make non-response bias an important concern.

Data integration is a new area of research to provide a timely solution to the above challenges. The goal is multi-fold: (1) minimize the cost associated with surveys, (2) minimize the respondent burden, and (3) maximize the statistical information or equivalently the efficiency of survey estimation. Narrowly speaking, survey integration means combining separate probability samples into one survey instrument (Bycroft 2010). Broadly speaking, one can consider combining probability samples with non-probability samples. Recently, in survey statistics, non-probability data become increasingly available for research purposes and provide unprecedented opportunities for new scientific discovery; however, they also present additional challenges such as heterogeneity, selection bias, high dimensionality, etc. The past years have seen immense progress in theories, methods, and algorithms for surmounting important challenges arising from non-probability data analysis. This article provides a systematic review of data integration for combining probability samples, probability and non-probability samples, and probability and big data samples.

Section 2 establishes notation and reviews these methods in the context of combining multiple probability samples. Existing methods for probability data integration can be categorized into two types depending on the level of information to be combined: a macro approach combining the summary statistics from multiple surveys and a micro approach creating synthetic imputations.

Section 3 describes the motivation, challenges, and methods for integrating probability and emergent non-probability samples. We also draw connections of survey data integration to combine randomized clinical trials and real-world data in Biostatistics. We then discuss a wide range of integration methods including calibration weighting, inverse probability weighting, mass imputation, and doubly robust methods.

We then consider data integration methods for combining probability and big non-probability samples. Depending on the roles in statistical inference, there are two types of big data: one with large sample sizes (large n) and the other with rich covariates (large p). In the first type, the non-probability sample can be large in sample size. How to leverage the rich information in the big data to improve the finite population inference is an important research. In the second type, there are a large number of variables. There is a large literature on variable selection methods for prediction, but little work on variable selection for data integration that can successfully recognize the strengths and the limitations of each data source and utilize all information captured for finite population inference. Section 4 presents robust data integration and variable selection methods in this context.

To summarize, Sect. 5 describes the direction of future research along the line of data integration including sensitivity analysis to assess the robustness of study conclusions to unverifiable assumptions, hierarchical modeling, and some cautionary remarks.

2 Combining probability samples

2.1 Multiple probability samples and missingness patterns

Combining two or more independent survey probability samples is a problem frequently encountered in the practice of survey sampling. For simplicity of exposition, let \({\mathcal {U}}=\{1,\ldots ,N\}\) be the index set of N units for the finite population, with N being the known population size. Let \((x_{i}^{{\text{T}}} ,y_{i} )^{{\text{T}}}\) be the realized value of a vector of random variables \((X^{{\text{T}}} ,Y)^{{\text{T}}}\) for unit i, where X consists of auxiliary variables and Y is the study variable of interest. The parameter of interest is the finite population mean of Y, i.e., \(\mu _{y}=N^{-1}\sum _{i=1}^{N}Y_{i}\) throughout the article. Let \(I_{i}\) be the sample indicator, such that \(I_{i}=1\) indicates the selection of unit i into the sample and \(I_{i}=0\) otherwise. The probability \(\pi _{i}=P(I_{i}=1\mid i\in {\mathcal {U}})\) is called the first-order inclusion probability and is known by the sampling design. The design weight is \(d_{i}=\pi _{i}^{-1}\). The joint probability \(\pi _{ij}=P(I_{i}I_{j}=1\mid i,j\in {\mathcal {U}})\) is called the second-order inclusion probability and is often used for variance estimation of the design-weighted estimator. In particular, \(\pi _{ii}=\pi _{i}\) for all i. The sample size is \(n=\sum _{i=1}^{N}I_{i}.\)

The main advantage of probability sampling is to ensure design-based inference. For example, the Horvitz–Thompson (HT) estimator of the population mean of y, denoted by \(\mu _{y}\), is \({\widehat{\mu }}_{\mathrm {HT}}=N^{-1}\sum _{i:I_{i}=1}\pi _{i}^{-1}y_{i}\), and the design-variance estimator is:

$$\begin{aligned} {\widehat{V}}_{\mathrm {HT}}=nN^{-2}\sum _{i:I_{i}=1}\sum _{j:I_{j}=1}\frac{(\pi _{ij}-\pi _{i}\pi _{j})}{\pi _{ij}}\frac{y_{i}}{\pi _{i}}\frac{y_{j}}{\pi _{j}}. \end{aligned}$$

We consider multiple sources of probability data. For multiple datasets, we use the subscript letter to indicate the respective sample; for example, we use \(d_{A,i}\) as the design weight of unit i in sample A.

Depending on the available information from multiple data sources, each sample has planned missingness by design. As illustrated in Table 1, the combined sample exhibits different missingness patterns: monotone and non-monotone. For monotone missingness, our framework covers two common types of studies. First, we have a large main dataset, and then collect more information on important variables for a subset of units, e.g., using a two-phase sampling design (Neyman 1938; Cochran 1977; Wang et al. 2009). Consider the U.S. Census of housing and population as an example. The short form consists of \(100\%\) sample, for which basic demographic information was obtained. The long form consists of about \(16\%\) sample, for which other social and economic information as well as demographic information were obtained. Deming and Stephan (1940) considered this setup as a classical two-phase sampling problem and use calibration weighting for demographic variable to match the known population counts from the short form.

Second, we have a smaller and carefully designed validation dataset with rich covariates, and then link it to a larger main dataset with fewer covariates. The setup of two independent samples with common items is often called non-nested two-phase sampling. Consider the US consumer expenditure survey as an example. Two independent samples were selected from the same finite population, including a diary survey sample, referred to as sample A, and a face-to-face survey sample, referred to as sample B. In sample A, observe auxiliary information X and outcome Y, whereas in sample B, observe common auxiliary information X. Zieschang (1990) considered using sample weighting to estimate detailed expenditure and income items combining sample A and sample B. Another example is the Canadian Survey of Employment, Payrolls, and Hours considered by Hidiroglou (2001). Sample A is a small sample from Statistics Canada Business Register, in which the study variables Y, number of hours worked by employees, and summarized earnings were observed. Sample B is a large sample drawn from a Canadian Customs and Revenue Agency administrative data, in which auxiliary variables X were observed.

Finally, we will consider combining two independent surveys with non-monotone missing patterns. Statistical matching technique will be introduced in Sect. 2.2.1 as a general statistical tool under this setup.

Table 1 Missingness patterns in the combined samples: “\(\checkmark\)” means “is measured”

2.2 Two approaches for probability data integration

We classify probability data integration methods based on the level of information to be combined: a macro approach and a micro approach. In the macro approach, we obtain summary information such as the point and variance estimates from multiple data sources and combine those to obtain a more efficient estimator of the parameter of interest, such as population means or totals. In the micro approach, we create single synthetic data that contain all available information from all data sources. The synthetic data can be used to estimate various types of the parameters.

2.2.1 Macro approach: generalized least-squares (GLS) estimation

Renssen and Nieuwenbroek (1997), Hidiroglou (2001), Merkouris (2004), Wu (2004), Ybarra and Lohr (2008), and Merkouris (2010) considered the problem of combining data from two independent probability samples to estimate totals at the population and domain levels. Merkouris (2004) and Merkouris (2010) provided a rigorous treatment of the survey integration through the generalized method of moments.

We focus on the monotone missingness pattern. The same discussion applies to the other patterns. From each probability sample, we obtain different estimators for the means of common items. The GLS approach combines those estimates as an optimal estimator. Let \({\widehat{\mu }}_{x,A}\) and \({\widehat{\mu }}_{x,B}\) be unbiased estimators of \(\mu _{x}\) from sample A and sample B, respectively. Let \({\widehat{\mu }}_{B}\) be an unbiased estimator of \(\mu _{y}\) from sample B.

To combine the multiple estimates, we can build a linear model of three estimates with two parameters as follows:

$$\begin{aligned} \left( \begin{array}{l} {\widehat{\mu }}_{x,A}\\ {\widehat{\mu }}_{x,B}\\ {\widehat{\mu }}_{B} \end{array}\right) =\left( \begin{array}{ll} 1 &{}\quad 0\\ 1 &{}\quad 0\\ 0 &{}\quad 1 \end{array}\right) \left( \begin{array}{l} \mu _{x}\\ \mu _{y} \end{array}\right) +\left( \begin{array}{l} e_{1}\\ e_{2}\\ e_{3} \end{array}\right) , \end{aligned}$$
(1)

where \((e_{1} ,e_{2} ,e_{3} )^{{\text{T}}}\) has mean \((0,0,0)^{{\text{T}}}\), variance–covariance:

$$\begin{aligned} V=\left( \begin{array}{lll} \mathrm {var}({\widehat{\mu }}_{x,A}) &{} \mathrm {cov}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{x,B}) &{} \mathrm {cov}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{B})\\ \mathrm {cov}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{x,B}) &{} \mathrm {var}({\widehat{\mu }}_{x,B}) &{} \mathrm {cov}({\widehat{\mu }}_{x,B},{\widehat{\mu }}_{B})\\ \mathrm {cov}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{B}) &{} \mathrm {cov}({\widehat{\mu }}_{x,B},{\widehat{\mu }}_{B}) &{} \mathrm {var}({\widehat{\mu }}_{B}) \end{array}\right) , \end{aligned}$$

and \(\mathrm {var}(\cdot )\) and \(\mathrm {cov}(\cdot )\) are the variance and covariance induced by the sampling probability. If the two samples are independently obtained from the sample population, we have \(\mathrm {cov}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{x,B})=0\) and \(\mathrm {cov}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{B})=0\).

Based on model (1), treat \(({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{x,B},{\widehat{\mu }}_{B})\) as observations and define a sum of squared error term:

$$\begin{aligned} Q(\mu _{x},\mu _{y})=\left( \begin{array}{l} {\widehat{\mu }}_{x,A}-\mu _{x}\\ {\widehat{\mu }}_{x,B}-\mu _{x}\\ {\widehat{\mu }}_{B}-\mu _{y} \end{array}\right)^{{\text{T}}}V^{-1}\left( \begin{array}{l} {\widehat{\mu }}_{x,A}-\mu _{x}\\ {\widehat{\mu }}_{x,B}-\mu _{x}\\ {\widehat{\mu }}_{B}-\mu _{y} \end{array}\right) . \end{aligned}$$

The optimal estimator of \((\mu _{x},\mu _{y})\) that minimizes \(Q(\mu _{x},\mu _{y})\) is:

$$\begin{aligned} {\widehat{\mu }}_{x}^{*}=\alpha ^{*}{\widehat{\mu }}_{x,A}+(1-\alpha ^{*}){\widehat{\mu }}_{x,B} \end{aligned}$$
(2)

and

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {GLS}}={\widehat{\mu }}_{B}+\left( \begin{array}{l} {\widehat{\mathrm {cov}}}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{B})\\ {\widehat{\mathrm {cov}}}({\widehat{\mu }}_{x,B},{\widehat{\mu }}_{B}) \end{array}\right)^{{\text{T}}}\left( \begin{array}{ll} {\widehat{\mathrm {var}}}({\widehat{\mu }}_{x,A}) &{} {\widehat{\mathrm {cov}}}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{x,B})\\ {\widehat{\mathrm {cov}}}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{x,B}) &{} {\widehat{\mathrm {var}}}({\widehat{\mu }}_{x,B}) \end{array}\right) ^{-1}\left( \begin{array}{l} {\widehat{\mu }}_{x}^{*}-{\widehat{\mu }}_{x,A}\\ {\widehat{\mu }}_{x}^{*}-{\widehat{\mu }}_{x,B} \end{array}\right) , \end{aligned}$$
(3)

where

$$\begin{aligned} \alpha ^{*}=\frac{{\widehat{\mathrm {var}}}({\widehat{\mu }}_{x,B})-{\widehat{\mathrm {cov}}}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{x,B})}{{\widehat{\mathrm {var}}}({\widehat{\mu }}_{x,A})+{\widehat{\mathrm {var}}}({\widehat{\mu }}_{x,B})-2{\widehat{\mathrm {cov}}}({\widehat{\mu }}_{x,A},{\widehat{\mu }}_{x,B})}. \end{aligned}$$

To see the efficiency gain of \({\widehat{\mu }}_{\mathrm {GLS}}\) over \({\widehat{\mu }}_{B}\), using (2), we express:

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {GLS}}={\widehat{\mu }}_{B}-{\widehat{\mathrm {cov}}}({\widehat{\mu }}_{B},{\widehat{\mu }}_{x,B}-{\widehat{\mu }}_{x,A})\left\{ {\widehat{\mathrm {var}}}({\widehat{\mu }}_{x,B}-{\widehat{\mu }}_{x,A})\right\} ^{-1}({\widehat{\mu }}_{x}^{*}-{\widehat{\mu }}_{x,B}). \end{aligned}$$

The variance of \({\widehat{\mu }}_{\mathrm {GLS}}\) is:

$$\begin{aligned} \mathrm {var}({\widehat{\mu }}_{B})-\mathrm {cov}({\widehat{\mu }}_{B},{\widehat{\mu }}_{x,B}-{\widehat{\mu }}_{x,A})\left\{ \mathrm {var}({\widehat{\mu }}_{x,B}-{\widehat{\mu }}_{x,A})\right\} ^{-1}\mathrm {cov}({\widehat{\mu }}_{B},{\widehat{\mu }}_{x,B}-{\widehat{\mu }}_{x,A}), \end{aligned}$$

which is not larger than \(\mathrm {var}({\widehat{\mu }}_{B})\). The GLS estimator for non-monotone missingness can be constructed similarly. See Fuller and Breidt (1999) for an application in the National Resource Inventory.

2.2.2 Micro approach: mass imputation

Mass imputation (also called synthetic data imputation) is a technique of creating imputed values for items not observed in the current survey by incorporating information from other surveys. Breidt et al. (1996) discussed mass imputation for two-phase sampling. Rivers (2007) proposed a mass imputation approach using nearest-neighbor imputation, but the theory is not fully developed. Schenker and Raghunathan (2007) reported several applications of synthetic data imputation, using a model-based method to estimate totals and other parameters associated with variables not observed in a larger survey but observed in a much smaller survey. Legg and Fuller (2009) and Kim and Rao (2012) developed synthetic imputation approaches to combining two surveys. Chipperfield et al. (2012) discussed composite estimation when one of the surveys is mass imputed. Bethlehem (2016) discussed practical issues in sample matching for mass imputation.

The primary goal is to create a single synthetic dataset of proxy values \({\widehat{y}}_{i}\) for the unobserved \(y_{i}\) in sample B and then use the proxy data together with the associated design weights of sample A to produce projection estimators of the population mean \(\mu _{y}\). This is particularly useful when sample B is a large-scale survey and item Y is very expensive to measure. The proxy values \({\widehat{y}}_{i}\) are generated by first fitting a working model relating Y to X, \(E(Y\mid X)=m(X;\beta _{0})\) based on the data \(\{(x_{i},y_{i}):i\in A\}\) from sample A. Then, the synthetic values of Y can be created by \({\widehat{y}}_{i}=m(x_{i};{\widehat{\beta }})\) for \(i\in B\). Thus, sample A is used as a training sample for predicting Y in sample B. The mass imputation estimator of \(\mu _{y}\) is \({\widehat{\mu }}_{\mathrm {I}}=N^{-1}\sum _{i\in B}d_{B,i}{\widehat{y}}_{i}\). Kim and Rao (2012) showed that \({\widehat{\mu }}_{\mathrm {I}}\) is asymptotically design-unbiased if \({\widehat{\beta }}\) satisfies:

$$\begin{aligned} \sum _{i\in A}d_{A,i}\{y_{i}-m(x_{i};{\widehat{\beta }})\}=0. \end{aligned}$$
(4)

With (4):

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {I}}=\, & {} N^{-1}\sum _{i\in B}d_{B,i}{\widehat{y}}_{i}+N^{-1}\sum _{i\in A}d_{A,i}(y_{i}-{\widehat{y}}_{i})\\=\, & {} N^{-1}\sum _{i\in B}d_{B,i}m(x_{i};\beta _{0})+N^{-1}\sum _{i\in A}d_{A,i}\{y_{i}-m(x_{i};\beta _{0})\}={\widehat{P}}_{B}+{\widehat{Q}}_{A}, \end{aligned}$$

and

$$\begin{aligned} \mathrm {var}({\widehat{\mu }}_{\mathrm {I}})=\mathrm {var}({\widehat{P}}_{B})+\mathrm {var}({\widehat{Q}}_{A}). \end{aligned}$$

The asymptotic unbiasedness holds regardless of whether the regression model is true or not. However, a good regression model will reduce the variance of \({\widehat{\mu }}_{\mathrm {I}}\). For variance estimation, either linearization or replication-based sampling (Kim and Rao 2012) can be used.

2.3 Mass imputation with non-monotone missingness

For non-monotone missingness, the mass imputation method of Kim and Rao (2012) is not directly applicable as the sample with partial observations may contain additional information for parameter estimation. Often, one can consider a joint model of all variables and use the EM algorithm to estimate the model parameters. The joint model deduces the conditional distribution of the missing variables given the observed values for imputation.

For illustration, consider the non-monotone missingness I structure in Table 1. The goal is to develop mass imputation for both \(Y_{2}\) in sample B and \(Y_{1}\) in sample C. It is attempting to specify the conditional distribution of \(Y_{2}\) given \((X,Y_{1})\) to impute \(Y_{2}\) in sample B and the conditional distribution of \(Y_{1}\) given \((X,Y_{2})\) to impute \(Y_{1}\) in sample C. However, this approach may result in model incompatiability. That is, there does not exist a joint model of \((Y_{1},Y_{2})\) given X that leads to the corresponding conditional distributions. To avoid model incompatibility, we use a joint model for \((Y_{1},Y_{2})\) given X for prediction though specifying the sequential conditional distribution:

$$\begin{aligned} f(Y_{1},Y_{2}\mid X;\theta )=f_{1}(Y_{1}\mid X;\theta _{1})f_{2}(Y_{2}\mid X,Y_{1};\theta _{2}), \end{aligned}$$
(5)

where \(\theta =(\theta _{1}^{{\text{T}}},\theta _{2}^{{\text{T}}})^{{\text{T}}}\), \(\theta _{1}\), and \(\theta _{2}\) are unknown parameters.

For parameter estimation, it suffices to use observations in sample A; however, this approach ignores the partial information in sample B and sample C and, therefore, is not efficient. Let the joint set of sampling indexes be \(S=A\cup B\cup C\). Assuming no overlap between the samples, we define:

$$\begin{aligned} \pi _{S,i}=P(i\in S\mid i\in {\mathcal {U}})=\left\{ \begin{array}{ll} \pi _{A,i} &{}\quad \text{ if } i\in A\\ \pi _{B,i} &{}\quad \text{ if } i\in B\\ \pi _{C,i} &{}\quad \text{ if } i\in C, \end{array}\right. \end{aligned}$$

and let \(d_{i}\) be the design weight for unit \(i\in S\) without specifying which sample it belongs to. That is, \(d_{i}=d_{A,i}\) if \(i\in A\). To incorporate all available information, the EM algorithm can be used as follows.

E-step:

Let \(\theta ^{(t)}\) be the parameter estimate at iteration t. Compute the conditional expectation of the pseudo-log-likelihood functions:

$$\begin{aligned} Q_{1}(\theta _{1}\mid \theta ^{(t)})= & {} \sum _{i\in S}d_{i}E\left\{ \log f_{1}(y_{1i}\mid x_{i};\theta _{1})\mid x_{i},y_{i,\mathrm {obs}};\theta ^{(t)}\right\} \\ Q_{2}(\theta _{2}\mid \theta ^{(t)})= & {} \sum _{i\in S}d_{i}E\left\{ \log f_{2}(y_{2i}\mid x_{i},y_{1i};\theta _{2})\mid x_{i},y_{i,\mathrm {obs}};\theta ^{(t)}\right\} , \end{aligned}$$

where \(y_{i,\mathrm {obs}}\) is the observed part of \((y_{1i},y_{2i})\).

M-step:

Update the parameter \(\theta\) by maximizing \(Q_{1}(\theta _{1}\mid \theta ^{(t)})\) and \(Q_{2}(\theta _{2}\mid \theta ^{(t)})\) with respect to \(\theta _{1}\) and \(\theta _{2}\).

The E-step and M-step can be iteratively computed until convergence, leading to the pseudo maximum likelihood estimator \({\widehat{\theta }}\).

Given \({\widehat{\theta }}\), mass imputation can be done for both \(Y_{2}\) in sample B and \(Y_{1}\) in sample C. The imputation model for \(Y_{2}\) in sample B is \(f_{2}(Y_{2}\mid X,Y_{1};{\widehat{\theta }}_{2})\). Also, the imputation model for \(Y_{1}\) in sample C is:

$$\begin{aligned} f(Y_{1}\mid X,Y_{2};{\widehat{\theta }})=\frac{f_{1}(Y_{1}\mid X;{\widehat{\theta }}_{1})f_{2}(Y_{2}\mid X,Y_{1};{\widehat{\theta }}_{2})}{\int f_{1}(Y_{1}\mid X;{\widehat{\theta }}_{1})f_{2}(Y_{2}\mid X,Y_{1};{\widehat{\theta }}_{2})\mathrm {d}Y_{1}}. \end{aligned}$$
(6)

To generate imputed values from (6), one may use Markov Chain Monte Carlo methods or the parametric fractional imputation of Kim (2011).

We now consider the non-monotone missingness II structure in Table 1. Sample A and sample B are probability samples which were selected from the same finite population. In sample A, observe \((X,Y_{1})\) and in sample B, observe \((X,Y_{2})\). The question of interest is the associational relationship of \(Y_{1}\) and \((X,Y_{2})\). If \((X,Y_{1},Y_{2})\) were jointly observed, one can fit a simple regression model of \(Y_{2}\) on \((X,Y_{2}).\) However, based on the available data, \(Y_{1}\) and \(Y_{2}\) were not available simultaneously.

This problem fits into the statistical matching framework (D’Orazio et al. 2006). In statistical matching, the goal is to create \(Y_{1}\) for each unit in sample B by finding a “statistical twin” from the sample A. Typically, one assumes the conditional independence assumption that \(Y_{1}\) and \(Y_{2}\) are conditionally independent given X,  or equivalently:

$$\begin{aligned} f(Y_{1}\mid X,Y_{2})=f(Y_{1}\mid X). \end{aligned}$$
(7)

Then, the “statistical twin” is solely determined by “how close” they are in terms of X’s. However, in a regression model of \(Y_{1}\) on \((X,Y_{2})\), (7) sets the regression coefficient associated with \(Y_{2}\) to be zero a priori, which is contrary to the study question of interest.

For a joint modeling of \((X,Y_{1},Y_{2})\) without assuming (7), identification is an important issue. Consider the following joint model of \((Y_{1},Y_{2})\) given X : 

$$\begin{aligned} Y_{1}= & {} \alpha _{0}+\alpha _{1}X+e_{1},\end{aligned}$$
(8)
$$\begin{aligned} Y_{2}= & {} \beta _{0}+\beta _{1}X+\beta _{2}Y_{1}+e_{2}, \end{aligned}$$
(9)

where \(\mathrm {cov}(e_{1},e_{2})=0\). Because \((X,Y_{1})\) is observed in sample A, \((\alpha _{0},\alpha _{1})\) is identifiable. Because \((X,Y_{2})\) is observed in sample B, \(f(Y_{2}\mid X)\) is identifiable.

Coupling (8) and (9) leads to:

$$\begin{aligned} Y_{2}=(\beta _{0}+\alpha _{0}\beta _{2})+(\beta _{1}+\alpha _{1}\beta _{2})X+\beta _{2}e_{1}+e_{2}. \end{aligned}$$

Thus, only \(\beta _{0}+\alpha _{0}\beta _{2}\) and \(\beta _{1}+\alpha _{1}\beta _{2}\) are identifiable, and \((\beta _{0},\beta _{1},\beta _{2})\) is not.

In general, non-linear relationships can help achieve identification. For example, if the linear relationship of X\(Y_{1}\)in (8) is:

$$\begin{aligned} Y_{1}= & {} \alpha _{0}+\alpha _{1}X+\alpha _{2}X^{2}+e_{1}. \end{aligned}$$
(10)

Again, \((\alpha _{0},\alpha _{1},\alpha _{2})\) is identifiable from sample A. Coupling (9) and (10) leads to:

$$\begin{aligned} Y_{2}=(\beta _{0}+\alpha _{0}\beta _{2})+(\beta _{1}+\alpha _{1}\beta _{2})X+(\alpha _{2}\beta _{2})X^{2}+\beta _{2}e_{1}+e_{2}. \end{aligned}$$

Thus, \(\beta _{0}+\alpha _{0}\beta _{2}\), \(\beta _{1}+\alpha _{1}\beta _{2}\) and \(\alpha _{2}\beta _{2}\) are identifiable from sample B. As long as \(\alpha _{2}\ne 0\), \((\beta _{0},\beta _{1},\beta _{2})\) is then identifiable. For an identifiable model, parameter estimation can be implemented either using the EM algorithm or GLS.

Other assumptions can be invoked to achieve model identification. Kim et al. (2016) used an instrumental variable assumption for model identification and develop fractional imputation methods for statistical matching. Park et al. (2016) presented an application of the statistical matching technique using fractional imputation in the context of handling mixed-mode surveys. Park et al. (2017) applied the method to combine two surveys with measurement errors.

3 Combining probability and non-probability samples

3.1 Combining a probability sample with a non-probability sample

Statistical analysis of non-probability survey samples faces many challenges as documented by Baker et al. (2013). Non-probability samples have unknown selection/inclusion mechanisms and are typically biased, and they do not represent the target population. A popular framework in dealing with the biased non-probability samples is to assume that auxiliary variable information on the same population is available from an existing probability survey sample. This framework was first used by Rivers (2007) and followed by a number of other authors including Vavreck and Rivers (2008), Lee and Valliant (2009), Valliant and Dever (2011), Elliott and Valliant (2017), and Chen et al. (2018), among others. Combining the up-to-date information from a non-probability sample and auxiliary information from a probability sample can be viewed as data integration, which is an emerging area of research in survey sampling (Lohr and Raghunathan 2017).

Data integration for finite population inference is similar to the problem of combining randomized experiments and non-randomized real-world evidence studies for causal inference of treatment effects (Keiding and Louis 2016). In randomized clinical trial, the treatment assignment mechanism is known and, therefore, treatment effect evaluation based on randomized clinical trial is unconfounded. However, due to restrictive inclusion and exclusion criteria, the trial sample may be narrowly defined and can not represent the real-world patient population. On the other hand, by the real-world data collection mechanism, the real-world evidence study is often representative of the target population. Combining trial and real-world evidence studies can achieve more robust and efficient inference of treatment effect for a target patient population. Table 2 draws a parallel comparison of data sources between data integration in survey sampling and that in treatment effect evaluation.

Table 2 Data integration in survey sampling and biostatistics

Survey statisticians and biostatisticians have provided different methods for combining information from multiple data sources. Lohr and Raghunathan (2017) and Rao (2020) provided comprehensive reviews of statistical methods for finite population inference. In biostatistics, meta-analysis has been a long-standing method to synthesize evidences from multiple trial and observational data. Meta-analysis combines aggregate information to accommodate heterogeneity in treatment effects estimated from trial and observational data; see Verde and Ohmann (2015) for an overview of different modeling techniques in meta-analysis. Existing methods for data integration of a probability sample and a non-probability sample can be categorized into three types as follows. The first type is the so-called propensity score adjustment (Rosenbaum and Rubin 1983). In this approach, the probability of a unit being selected into the non-probability sample, which is referred to as the propensity or sampling score, is modeled and estimated for all units in the non-probability sample. The subsequent adjustments, such as propensity score weighting or stratification, can then be used to adjust for selection biases; see, e.g., Lee and Valliant (2009), Elliott and Valliant (2017) and Chen et al. (2018). Stuart et al. (2011, (2015) and Buchanan et al. (2018) used propensity score weighting to generalize results from randomized trials to a target population. O’Muircheartaigh and Hedges (2014) proposed propensity score stratification for analyzing a non-randomized social experiment. One notable disadvantage of the propensity score methods is that they rely on an explicit propensity score model and are biased and highly variable if the model is mis-specified (Kang and Schafer 2007). The second type uses calibration weighting (Deville and Särndal 1992; Kott 2006). This technique calibrates auxiliary information in the non-probability sample with that in the probability sample, so that after calibration, the weighted distribution of the non-probability sample is similar to that of the target population. The third type is mass imputation, which imputes the missing values for all units in the probability sample. In the usual imputation for missing data analysis, the respondents in the sample constitute a training dataset for developing an imputation model. In the mass imputation, an independent non-probability sample is used as a training dataset, and imputation is applied to all units in the probability sample; see, e.g., Breidt et al. (1996), Rivers (2007), Kim and Rao (2012), Chipperfield et al. (2012), Bethlehem (2016) and Yang and Kim (2018).

3.2 Setup and assumptions

Non-probability samples become increasingly popular in survey statistics, but may suffer from selection bias that limits the generalizability of results to the target population. We consider integrating a non-probability sample with a carefully designed probability sample which provides the representative covariate information of the target population.

Let \(X\in {\mathbb {R}}^{p}\) be a vector of auxiliary variables (including an intercept) that are available from two data sources, and let \(Y\in {\mathbb {R}}\) be the study variable of interest. We consider combining a probability sample with X, referred to as sample A, and a non-probability sample with (XY), referred to as sample B, to estimate \(\mu _{y}\) the population mean of Y. We focus on the case where the study variable Y is observed in sample B only, but the other auxiliary variables are commonly observed in both data. Although the big data source has a large sample size, the sampling mechanism is often unknown, and we cannot compute the first-order inclusion probability for Horvitz–Thompson estimation. The naive estimators without adjusting for the sampling process are subject to selection biases, as illustrated in Table 3. On the other hand, although the probability sample with design weights represents the finite population, it does not observe the study variable. The complementary features of probability samples and non-probability samples raise the question of whether it is possible to develop data integration methods that leverage the advantages of both sources.

Table 3 Illustration of the total error from the simple mean estimator of \({\bar{Y}}_{N}\) based on probability simple random sample and big non-probability sample

Because the sampling mechanism of a non-probability sample is unknown, the target population quantity is not identifiable in general. Unlike the previous case in Sect. 2, the sampling mechanism of sample B is unknown and, therefore, \(\mu _{y}\) is not identifiable in general.

Two datasets were considered from the 2005 Pew Research Centre (PRC) and the 2005 Behavioral Risk Factor Surveillance System (BRFSS). The goal of the PRC study was to evaluate the relationship between individuals and community (Chen et al. 2018; Kim et al. 2018). The 2005 PRC data are non-probability sample data provided by eight different vendors, which consist of \(n_{B}=9301\) subjects. Yang et al. (2019) focus on two study variables, a continuous \(Y_{1}\) (days had at least one drink last month) and a binary \(Y_{2}\) (an indicator of voted local elections). The 2005 BRFSS sample is a probability sample, which consists of \(n_{A}\) = 441,456 subjects with survey weights. This dataset does not have measurements on the study variables of interest; however, it contains a rich set of common covariates with the PRC dataset. The covariate distributions from the PRC sample and the BRFSS sample are considerably different, e.g., age, education (high school or less), financial status (no money to see doctors, own house), retirement rate, and health (smoking). Therefore, the PRC dataset is not representative of the target population, and the naive analyses of the study variables are subject to selection biases.

Let \(f(Y\mid X)\) be the conditional distribution of Y given X in the superpopulation model \(\zeta\) that generates the finite population. We make the following assumption.

Assumption 1

(i) The sampling indicator \(I_{B}\) of sample B and the study variable Y is independent given X; i.e., \(P(I_{B}=1\mid X,Y)=P(I_{B}=1\mid X)\), referred to as the sampling score \(\pi _{B}(X)\); and (ii) \(\pi _{B}(X)>0\) for all X.

Assumptions 1 (i) and (ii) constitute the strong ignorability condition (Rosenbaum and Rubin 1983). This assumption holds if the set of covariates contains all predictors for the outcome that affect the possibility of being selected in sample B. This setup has previously been used by several authors; see, e.g., Rivers (2007) and Vavreck and Rivers (2008). Assumption 1 (i) states the ignorability of the selection mechanism to sample B conditional upon the covariates. Under Assumption 1 (i), \(E(Y\mid X)=E(Y\mid X,I_{B}=1)\), denoted by m(X), can be estimated based on sample B. Assumption 1 (ii) implies that the support of in sample B is the same as that in the finite population. Assumption 1 (ii) does not hold if certain units would never be included in the non-probability sample. The plausibility of this assumption can be easily checked by comparing the marginal distributions of the auxiliary variables in sample B with those in sample A.

Under the sampling ignorability assumption, there are two main approaches: (1) the weighting approach by constructing weights for sample B to improve the representativeness of sample B; (2) the imputation approach by creating mass imputation for sample A using the observations in sample B. There is considerable interest in bridging the findings from a randomized clinical trial to the target population. This problem has been termed as generalizability (Cole and Stuart 2010; Stuart et al. 2011; Hernan and VanderWeele 2011; Tipton 2013; O’Muircheartaigh and Hedges 2014; Stuart et al. 2015; Keiding and Louis 2016; Buchanan et al. 2018), external validity (Rothwell 2005), or transportability (Pearl and Bareinboim 2011; Rudolph and van der Laan 2017) in the statistics literature, and has connections to the covariate shift problem in machine learning (Sugiyama and Kawanabe 2012).

3.3 Propensity score weighting

Under Assumption 1 (i) and (ii), we can build a model for \(\pi _{B}(X)=P(I_{B}=1\mid X)\) and use it to adjust for the selection bias in sample B. In practice, the propensity score function \(\pi _{B}(X)\) is unknown and needs to be estimated from the data. Let \(\pi _{B}(X;\alpha )\) be the posited models for \(\pi _{B}(X)\), where \(\alpha\) is the unknown parameter. Several authors have proposed different estimation strategies. For example, \({\widehat{\alpha }}\) can be obtained by a weighted regression of \(I_{B,i}\) on \(x_{i}\) combining sample A and sample B (\(I_{B,i}=0\) for \(i\in A\) and \(I_{B,i}=1\) for \(i\in B\)), weighted by the design weights from sample A, which is valid if the size of sample B is relatively small (Valliant and Dever 2011). Chen et al. (2018) proposed estimating \(\alpha\) by solving:

$$\begin{aligned} {\widehat{S}}_{1}(\alpha )=\sum _{i\in B}x_{i}-\sum _{i\in A}d_{A,i}\pi _{B}(x_{i};\alpha )x_{i}=0, \end{aligned}$$
(11)

which is a sample version of the population estimating equation \(S(\alpha )=\sum _{i\in U}\left\{ I_{B,i}-\pi (x_{i};\alpha )\right\} x_{i}=0.\) Instead of using (11), one can also use:

$$\begin{aligned} {{\widehat{S}}_{2}(\alpha )}=\sum _{i\in B}\frac{1}{\pi _{B}(x_{i};\alpha )}x_{i}-\sum _{i\in A}d_{A,i}x_{i}=0, \end{aligned}$$

which is closely related to the calibration weighting approach for nonresponse adjustment.

Given \({\widehat{\alpha }}\), the inverse probability of sampling weighting estimator of \(\mu _{y}\) is:

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {IPW}}={\widehat{\mu }}_{\mathrm {IPW}}({\widehat{\alpha }})=N^{-1}\sum _{i=1}^{N}\frac{I_{B,i}}{\pi _{B}(x_{i};{\widehat{\alpha }})}y_{i}. \end{aligned}$$
(12)

Variance estimation of \({\widehat{\mu }}_{\mathrm {IPW}}\) can be obtained by the standard M-estimation theory.

One of the notable disadvantages of the propensity score methods is that they rely on an explicit propensity score model and are biased if the model is mis-specified (Kang and Schafer 2007). Moreover, if the estimated propensity score is close to zero, \({\widehat{\mu }}_{\mathrm {IPW}}\) will be highly unstable.

3.4 Calibration weighting

The second weighting strategy is calibration weighting or bench marking weighting (Deville and Särndal 1992; Kott 2006). This technique can be used to calibrate auxiliary information in the non-probability sample with that in the probability sample, so that, after calibration, the non-probability sample is similar to the target population.

Instead of estimating the propensity score model and inverting the propensity score to correct for the selection bias of the non-probability sample, the calibration strategy estimates the weights directly. Toward this end, we assign a weight \(\omega _{B,i}\) to each unit i in the sample B, so that:

$$\begin{aligned} \sum _{i\in B}\omega _{B,i}x_{i}=\sum _{i\in A}d_{A,i}x_{i}. \end{aligned}$$
(13)

where \(\sum _{i\in A}d_{A,i}x_{i}\) is a design-weighted estimate of the population total of X from the probability sample. Constraint (13) is referred to as the covariate balancing constraint (Imai and Ratkovic 2014), and weights \({\mathcal {Q}}_{B}=\{\omega _{B,i}:i\in B\}\) are the calibration weights. The balancing constraint calibrates the covariate distribution of the non-probability sample to the target population in terms of X. Instead of calibrating each X,  one can calibrate model-based calibration (McConville et al. 2017; Chen et al. 2018, 2019). In this approach, one can posit a parametric model for \(E(Y\mid X)=m(X;\beta )\) and estimate the unknown parameter \(\beta\) based on sample B. The model-based calibration specifies the constraints for \({\mathcal {Q}}_{B}\) as:

$$\begin{aligned} \sum _{i\in B}\omega _{B,i}m(x_{i};{\widehat{\beta }})=\sum _{i\in A}d_{A,i}m(x_{i};{\widehat{\beta }}). \end{aligned}$$
(14)

We estimate \({\mathcal {Q}}_{B}\) by solving the following optimization problem:

$$\begin{aligned} \underset{{\mathcal {Q}}_{B}}{\text {min}}\left\{ L({\mathcal {Q}}_{B})=\sum _{i\in B}\omega _{B,i}\log \omega _{B,i}\right\} , \end{aligned}$$
(15)

subject to \(\omega _{B,i}\ge 0,\;\) for all \(i\in B\); \(\sum _{i\in B}\omega _{B,i}=N\), and the balancing constraint (13) or (14).

The objective function in (15) is the negative entropy of the calibration weights; thus, minimizing this criteria ensures that the empirical distribution of calibration weights is not too far away from the uniform, such that it minimizes the variability due to heterogeneous weights. This optimization problem can be solved using convex optimization with Lagrange multiplier. Other objective functions, such as \(L({\mathcal {Q}}_{B})=\sum _{i\in B}\omega _{B,i}^{2}\), can also be considered. This optimization problem can be solved using convex optimization with Lagrange multiplier. By introducing Lagrange multiplier \(\lambda\), the objective function becomes:

$$\begin{aligned} L(\lambda ,{\mathcal {Q}}_{B})=\sum _{i\in B}\omega _{B,i}\log \omega _{B,i}-\lambda^{{\text{T}}}\left\{ \sum _{i\in B}\omega _{B,i}x_{i}-\sum _{i\in A}d_{A,i}x_{i}\right\} . \end{aligned}$$
(16)

Thus, by minimizing (16), the estimated weights are:

$$\begin{aligned} \omega _{B,i}=\omega _{B}(x_{i};{\widehat{\lambda }})=\frac{N\exp \left( {\widehat{\lambda }}^{{\text{T}}}x_{i}\right) }{\sum _{i\in B}\exp \left( {\widehat{\lambda }}^{{\text{T}}}x_{i}\right) }, \end{aligned}$$

and \({\widehat{\lambda }}\) solves the equation:

$$\begin{aligned} U(\lambda )=\sum _{i\in B}\exp \left( \lambda ^{{\text{T}}}x_{i}\right) \left\{ x_{i}-{ \frac{1}{N} }\sum _{i\in A}d_{A,i}x_{i}\right\} =0, \end{aligned}$$
(17)

which is the dual problem to the optimization problem (15).

The calibration weighting estimator is:

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {cal}}=\frac{1}{N}\sum _{i=1}^{N}\omega _{B,i}I_{B,i}y_{i}. \end{aligned}$$
(18)

Variance estimation of \({\widehat{\mu }}_{\mathrm {cal}}\) can be obtained by the standard M-estimation theory by treating \(\lambda\) as the nuisance parameter and (17) as the corresponding estimating equation.

The justification for \({\widehat{\mu }}_{\mathrm {cal}}\) subject to constraint (13) relies on the linearity of the outcome model, i.e., \(m(X)=X^{{\text{T}}}\beta ^{*}\) for some \(\beta ^{*}\), or the linearity of the inverse probability of sampling weight, i.e., \(\{\pi _{B}(X)\}^{-1}=X^{{\text{T}}}\alpha ^{*}\) for some \(\alpha ^{*}\) (Fuller 2009; Theorem 5.1). The linearity conditions are unlikely to hold for non-continuous variables. In these cases, \({\widehat{\mu }}_{\mathrm {cal}}\) may be biased. The justification for \({\widehat{\mu }}_{\mathrm {cal}}\) subject to constraint (14) relies on a correct specification of \(m(X;\beta )\) in the data integration problem.

Chan et al. (2016) generalize this idea further to develop a general calibration weighting method that satisfies the covariate balancing property with increasing dimensions of the control variables m(x). Zhao (2019) developed a unified approach of covariate balancing PS method using tailored loss functions. The regularization techniques using penalty terms into the loss function can be naturally incorporated into the framework and machine learning methods, such as boosting, can be used. The covariate balancing condition, or calibration condition, in (13) can be relaxed. Zubizarreta (2015) relaxed the exact balancing constraints to some tolerance level. Wong et al. (2019) used the theory of reproducing Kernel Hilbert space to develop an uniform approximate balance for covariate functions.

3.5 Mass imputation approach

The third type is mass imputation, where the imputed values are created for the whole elements in the probability sample. In the usual imputation for missing data analysis, the respondents in the sample provide a training dataset for developing an imputation model. In the mass imputation, an independent big data sample is used as a training dataset, and imputation is applied to all units in the probability sample. While the mass imputation idea for incorporating information from big data is very natural, the literature on mass imputation itself is very sparse.

In a parametric approach, let \(m(X;\beta )\) be the posited model for m(X), where \(\beta \in {\mathcal {R}}^{p}\) is the unknown parameter. Under Assumption 1, \({\widehat{\beta }}\) can be obtained by fitting the model to sample B. We assume that \({\widehat{\beta }}\) is the unique solution to:

$$\begin{aligned} {\widehat{U}}(\beta )=\sum _{i\in B}\left\{ y_{i}-m(x_{i};\beta )\right\} h(x_{i};\beta )=0 \end{aligned}$$

for some p-dimensional vector \(h(x_{i};\beta )\). Thus, we use the observations in sample B to obtain \({\widehat{\beta }}\) and use it to construct \({\widehat{y}}_{i}=m(x_{i};{\widehat{\beta }})\) for all \(i\in A\).

Under some regularity conditions, the mass imputation estimator

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {I}}={\widehat{\mu }}_{\mathrm {I}}({\widehat{\beta }})=N^{-1}\sum _{i\in A}d_{A,i}m(x_{i};{\widehat{\beta }}) \end{aligned}$$

satisfies: \({\widehat{\mu }}_{\mathrm {I}}={\widehat{\mu }}_{\mathrm {I}}(\beta _{0})+o_{P}(n_{B}^{-1/2})\), where

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {I}}(\beta )= & {} N^{-1}\sum _{i\in A}d_{A,i}m(x_{i};\beta )+n_{B}^{-1}\sum _{i\in B}\left\{ y_{i}-m(x_{i};\beta )\right\} h(x_{i};\beta )^{{\text{T}}}c^{*},\\ c^{*}= & {} \left\{ n_{B}^{-1}\sum _{i\in B}{\dot{m}}(x_{i};\beta _{0})h^{{\text{T}}}(x_{i};\beta _{0})\right\} ^{-1}\left\{ N^{-1}\sum _{i=1}^{N}{\dot{m}}(x_{i};\beta _{0})\right\} , \end{aligned}$$

where \(\beta _{0}\) is the true value of \(\beta\) and \({\dot{m}}(x;\beta )=\partial m(x;\beta )/\partial \beta\).

Also:

$$\begin{aligned} E\{{\widehat{\mu }}_{\mathrm {I}}(\beta _{0})-\mu _{y}\}=0, \end{aligned}$$

and

$$\begin{aligned} \mathrm {var}\left\{ {\widehat{\mu }}_{\mathrm {I}}(\beta _{0})-\mu _{y}\right\}&=\mathrm {var}\left\{ N^{-1}\sum _{i\in A}d_{A,i}m(x_{i};\beta _{0})-N^{-1}\sum _{i\in U}m(x_{i};\beta _{0})\right\} \\&\quad +E\left[ n_{B}^{-2}\sum _{i\in B}E\left( e_{i}^{2}\mid x_{i}\right) \left\{ h(x_{i};\beta _{0})^{{\text{T}}}c^{*}\right\} ^{2}\right] , \end{aligned}$$

where \(e_{i}=y_{i}-m(x_{i};\beta _{0})\). The justification for \({\widehat{\mu }}_{\mathrm {I}}\) relies on a correct specification of \(m(X;\beta )\) and the consistency of \({\widehat{\beta }}\). If \(m(X;\beta )\) is mis-specified or \({\widehat{\beta }}\) is inconsistent, \({\widehat{\mu }}_{\mathrm {I}}\) can be biased. For variance estimation, either linearization method or bootstrap method can be used. See Kim et al. (2018) for more details.

3.6 Doubly robust estimation

To improve the robustness against model mis-specification, one can consider combining the weighting and imputation approaches (Kim and Wang 2018). The doubly robust estimator employs both the propensity score and the outcome models, which is given by:

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {dr}}={\widehat{\mu }}_{\mathrm {dr}}({\widehat{\alpha }},{\widehat{\beta }})=N^{-1}\sum _{i=1}^{N}\left[ \frac{I_{B,i}}{\pi _{B}(x_{i};{\widehat{\alpha }})}\{y_{i}-m(x_{i};{\widehat{\beta }})\}+I_{A,i}d_{A,i}m(x_{i};{\widehat{\beta }})\right] . \end{aligned}$$
(19)

The estimator \({\widehat{\mu }}_{\mathrm {dr}}\) is doubly robust in the sense that it is consistent if either the propensity score model or the outcome model is correctly specified, not necessarily both. Moreover, it is locally efficient if both models are correctly specified (Bang and Robins 2005; Cao et al. 2009). Let \({\widehat{\mu }}_{\mathrm {HT}}=N^{-1}\sum _{i\in A}d_{A,i}y_{i}\) be the Horvitz–Thompson estimator that could be used if \(y_{i}\) were observed in sample A. We express \({\widehat{\mu }}_{\mathrm {dr}}-{\widehat{\mu }}_{\mathrm {HT}}=-\sum _{i\in A}d_{A,i}{\widehat{e}}_{i}+\sum _{i\in B}\{\pi _{B}(x_{i};{\widehat{\alpha }})\}^{-1}{\widehat{e}}_{i}\), where \({\widehat{e}}_{i}=y_{i}-{\widehat{y}}_{i}\). To show the double robustness of \({\widehat{\mu }}_{\mathrm {dr}}\), we consider two scenarios. In the first scenario, if \(\pi _{B}(X;\alpha )\) is correctly specified, then:

$$\begin{aligned} E\left( {\widehat{\mu }}_{\mathrm {dr}}-{\widehat{\mu }}_{\mathrm {HT}}\mid {\mathcal {F}}_{N}\right) \cong -\sum _{i\in A}d_{A,i}{\widehat{e}}_{i}+\sum _{i\in U}{\widehat{e}}_{i}, \end{aligned}$$

which is design-unbiased of zero. In the second scenario, if \(m(X;\beta )\) is correctly specified, then \(E({\widehat{e}}_{i})\cong 0\). In both cases, \({\widehat{\mu }}_{\mathrm {dr}}-{\widehat{\mu }}_{\mathrm {HT}}\) is unbiased of zero and, therefore, \({\widehat{\mu }}_{\mathrm {dr}}\) is unbiased of \(\mu _{y}\).

If either \(\pi _{B}(X^{{\text{T}}}\alpha )\) or \(m(X^{{\text{T}}}\beta )\) is correctly specified:

$$\begin{aligned} n^{1/2}\left\{ {\widehat{\mu }}_{\mathrm {dr}}({\widehat{\alpha }},{\widehat{\beta }})-\mu \right\} \rightarrow {\mathcal {N}}\left( 0,V\right) , \end{aligned}$$

as \(n\rightarrow \infty\), where \(V=\lim _{n\rightarrow \infty }(V_{1}+V_{2}):\)

$$\begin{aligned} V_{1}= & {} E\left\{ \frac{n}{N^{2}}\sum _{i=1}^{N}\sum _{j=1}^{N}(\pi _{A,ij}-\pi _{A,i}\pi _{A,j})\frac{m(x_{i}^{{\text{T}}}\beta ^{*})}{\pi _{A,i}}\frac{m(x_{j}^{{\text{T}}}\beta ^{*})}{\pi _{A,j}}\right\} ,\\ V_{2}= & {} \frac{n}{N^{2}}\sum _{i=1}^{N}E\left[ \left\{ \frac{I_{B,i}}{\pi _{B,i}(x_{i}^{{\text{T}}}\alpha ^{*})}-1\right\} ^{2}\left\{ y_{i}-m(x_{i}^{{\text{T}}}\beta ^{*})\right\} ^{2}\right] . \end{aligned}$$

To estimate \(V_{1}\), we can use the design-based variance estimator applied to \(m(X_{i}^{{\text{T}}}{\widehat{\beta }})\) as:

$$\begin{aligned} {\widehat{V}}_{1}={\frac{n}{N^{2}}}\sum _{i\in A}\sum _{j\in A}\frac{(\pi _{A,ij}-\pi _{A,i}\pi _{A,j})}{\pi _{A,ij}}\frac{m(X_{i}^{{\text{T}}}{\widehat{\beta }})}{\pi _{A,i}}\frac{m(X_{j}^{{\text{T}}}{\widehat{\beta }})}{\pi _{A,j}}. \end{aligned}$$
(20)

To estimate \(V_{2},\) we further express \(V_{2}\) as:

$$\begin{aligned} V_{2}={\frac{n}{N^{2}}}\sum _{i=1}^{N}E\left[ \left\{ \frac{I_{B,i}}{\pi _{B,i}(X_{i}^{{\text{T}}}\alpha ^{*})^{2}}-\frac{2I_{B,i}}{\pi _{B,i}(X_{i}^{{\text{T}}}\alpha ^{*})}\right\} \left\{ Y_{i}-m(X_{i}^{{\text{T}}}\beta ^{*})\right\} ^{2}+\left\{ Y_{i}-m(X_{i}^{{\text{T}}}\beta ^{*})\right\} ^{2}\right] . \end{aligned}$$
(21)

Let \(\sigma ^{2}(X_{i}^{{\text{T}}}\beta ^{*})=E\left[ \left\{ Y_{i}-m(X_{i}^{{\text{T}}}\beta ^{*})\right\} ^{2}\right]\), and let \({\widehat{\sigma }}^{2}(X_{i})\) be a consistent estimator of \(\sigma ^{2}(X_{i}^{{\text{T}}}\beta ^{*})\). We can then estimate \(V_{2}\) by:

$$\begin{aligned} {\widehat{V}}_{2}=\frac{n}{N^{2}}\sum _{i=1}^{N}\left[ \left\{ \frac{I_{B,i}}{\pi _{B}(X_{i}^{{\text{T}}}{\widehat{\alpha }})^{2}}-\frac{2I_{B,i}}{\pi _{B}(X_{i}^{{\text{T}}}{\widehat{\alpha }})}\right\} \left\{ Y_{i}-m(X_{i}^{{\text{T}}}{\widehat{\beta }})\right\} ^{2}+I_{A,i}d_{A,i}{\widehat{\sigma }}^{2}(X_{i})\right] . \end{aligned}$$
(22)

By the law of large numbers, \({\widehat{V}}_{2}\) is consistent for \(V_{2}\) regardless of whether one of \(\pi _{B,i}(X_{i}^{{\text{T}}}\alpha )\) or \(\pi _{B,i}(X_{i}^{{\text{T}}}\beta )\) is mis-specified, and therefore, it is doubly robust.

4 Combining probability and big data

4.1 Big data sample

To meet the new challenges in the probability sampling, statistical offices face the increasing pressure to utilize convenient but often uncontrolled big data sources, such as satellite information (McRoberts et al. 2010), mobile sensor data (Palmer et al. 2013), and web survey panels (Tourangeau et al. 2013). Couper (2013), Citro (2014), Tam and Clarke (2015), and Pfeffermann et al. (2015) articulated the promise of harnessing big data for official and survey statistics, but also raised many issues regarding big data sources. While such data sources provide timely data for a large number of variables and population elements, they are non-probability samples and often fail to represent the target population of interest because of inherent selection biases. Tam and Kim (2018) also covered some ethical challenges of big data for official statisticians and discuss some preliminary methods of correcting for selection bias in big data.

Combining information from several sources to improve estimates for population parameters is an important practical problem in survey sampling. In the past decade, more and more auxiliary information became available, including large administrative record datasets and remote-sensing data derived from satellite images. How to combine such information with survey data to provide better estimates for population parameters is a new challenge that survey statisticians face today. Tam and Clarke (2015) presented an overview of some initiatives of big data applications in official statistics of the Australian Bureau of Statistics. Such big data are becoming increasingly popular, and they come from a variety of sources such as remote-sensing data, administrative data such as tax data, so on.

Suppose that there are two data sources, one from a probability sample, referred to as sample A, and the other from a big data source, referred to as sample B. Table 4 illustrates the observed data structure.

Table 4 Data structure for data integration with big data

4.2 Scenario 1: leverage auxiliary information in big data to improve efficiency

In Scenario 1, the probability sample contains Y observations. Therefore, \(\mu _{y}\) is identifiable and can be estimated by the commonly-used estimator solely from sample A, denoted by \({\widehat{\mu }}_{A}\). We can leverage the X information in the big data sample to improve the sample A estimator. We consider the case where additionally the membership to the big data can be determined throughout the probability sample. The key insight is that the subsample of units in sample A with the big data membership constitutes a second-phase sample from the big data sample, which acts as a new population. We calibrate the information in the second-phase sample to be the same as the new acting population. The calibration process in turn improves the accuracy of the mass imputation estimator without specifying any model assumptions. Let \(h=(I_{B},1-I_{B},I_{B}X)\).

Following Yang and Ding (2018), we can consider a class of estimators satisfying:

$$\begin{aligned} n_{A}^{1/2}\left( \begin{array}{l} {\widehat{\mu }}_{A}-\mu _{y}\\ {\widehat{h}}_{A}-{\widehat{h}}_{B} \end{array}\right) \rightarrow {\mathcal {N}}\left\{ 0,\left( \begin{array}{ll} V_{yy,A} &{} \Gamma ^{{\text{T}}}\\ \Gamma &{} V \end{array}\right) \right\} , \end{aligned}$$
(23)

in distribution, as \(n_{A}\rightarrow \infty\), where \({\widehat{h}}_{A}=N^{-1}\sum _{i\in A}d_{A,i}h_{i}\) and \({\widehat{h}}_{B}=N^{-1}\sum _{i\in B}h_{i}\). Heuristically, if (23) holds exactly rather than asymptotically, by the multivariate normal theory, we have the following conditional distribution:

$$\begin{aligned} n_{A}^{1/2}({\widehat{\mu }}_{A}-\mu _{y})\bigg | n_{A}^{1/2}({\widehat{h}}_{A}-{\widehat{h}}_{B})\sim {\mathcal {N}}\left\{ n_{A}^{1/2}\Gamma ^{{\text{T}}}V^{-1}({\widehat{h}}_{A}-{\widehat{h}}_{B}),V_{yy,A}-\Gamma ^{{\text{T}}}V^{-1}\Gamma \right\} . \end{aligned}$$

Let \({\widehat{V}}_{yy,A},\) \({\widehat{\Gamma }}\) and \({\widehat{V}}\) be consistent estimators for \(V_{yy,A},\) \(\Gamma\), and V. We set \(n_{A}^{1/2}({\widehat{\mu }}_{A}-\mu _{y})\) to equal its estimated conditional mean \(n_{A}^{1/2}{\widehat{\Gamma }}^{{\text{T}}}{\widehat{V}}^{-1}({\widehat{h}}_{A}-{\widehat{h}}_{B})\), leading to an estimating equation for \(\mu _{y}\):

$$\begin{aligned} n_{A}^{1/2}({\widehat{\mu }}_{A}-\mu _{y})=n_{A}^{1/2}{\widehat{\Gamma }}^{{\text{T}}}{\widehat{V}}^{-1}({\widehat{h}}_{A}-{\widehat{h}}_{B}). \end{aligned}$$

Solving this equation for \(\mu _{y}\), we obtain the estimator:

$$\begin{aligned} {\widehat{\mu }}={\widehat{\mu }}_{A}-{\widehat{\Gamma }}^{{\text{T}}}{\widehat{V}}^{-1}({\widehat{h}}_{A}-{\widehat{h}}_{B}).. \end{aligned}$$
(24)

Under certain regularity conditions, if (23) holds, then \({\widehat{\mu }}\) is consistent for \(\mu _{y}\), and:

$$\begin{aligned} n_{A}^{1/2}({\widehat{\mu }}-\mu _{y})\rightarrow {\mathcal {N}}(0,V_{yy,A}-\Gamma ^{{\text{T}}}V^{-1}\Gamma ), \end{aligned}$$
(25)

in distribution, as \(n_{A}\rightarrow \infty\). Given a nonzero \(\Gamma\), the asymptotic variance, \(V_{yy,A}-\Gamma ^{{\text{T}}}V^{-1}\Gamma ,\) is smaller than the asymptotic variance of \({\widehat{\mu }}_{A}\), \(V_{yy,A}\).

The asymptotic variance of \({\widehat{\mu }}\) can be estimated by:

$$\begin{aligned} {\widehat{V}}=\left( {\widehat{V}}_{yy,A}-{\widehat{\Gamma }}^{{\text{T}}}{\widehat{V}}^{-1}{\widehat{\Gamma }}\right) /n_{A}. \end{aligned}$$
(26)

Kim and Tam (2018) also explored similar ideas. They develop a calibration weighting method to incorporate the big data auxiliary information and apply the method to the official statistics in Australian Bureau of Statistics. In this application, the big data is the Australian Agricultural Census with 85% response rate and the probability sample is the Rural Environment and Agricultural Commodities Survey used for calibration. In this application, the measurement from Census data is the auxiliary variable used for calibration.

4.3 Scenario 2: leverage probability sampling designs to correct for selection bias

In Scenario 2, we have a similar setup as in Sect. 3. Depending on the roles in statistical inference, there are two types of big data: one with large sample sizes (large n) and the other with rich covariates (large p). We review methods for the two types of big data.

4.3.1 Robust mass imputation estimation

In the first type, the non-probability sample can be large in sample size. How to leverage the rich information in the big data to improve the finite population inference is an important research. We review robust mass imputation methods.

When the sample size of the big data is large, mass imputation is more desirable. In mass imputation, we can train a predictive model from the big data and impute the missing \(y_{i}\) in sample A. Instead of a parametric approach, we can also consider non-parametric approaches. To find suitable imputed values, we consider nearest-neighbor imputation; that is, find the closest matching unit from sample B based on the X values and use the corresponding Y value from this unit as the imputed value.

Using sample B (big data) as a training data, find the nearest neighbor of each unit \(i\in A\) using a distance measure \(d(x_{i},x_{j})\). Let i(1) be the index of its nearest neighbor, which satisfies:

$$\begin{aligned} d(x_{i(1)},x_{i})\le d(x_{j},x_{i}),\forall j\in B. \end{aligned}$$

The nearest-neighbor imputation estimator of \(\mu\) is:

$$\begin{aligned} {\widehat{\mu }}_{\mathrm {nni}}=N^{-1}\sum _{i\in A}d_{A,i}y_{i(1)}. \end{aligned}$$

Yang and Kim (2018) showed that under some regularity conditions, \({\widehat{\mu }}_{\mathrm {nni}}\) has the same asymptotic distribution as \({\widehat{\mu }}_{\mathrm {HT}}=N^{-1}\sum _{i\in A}d_{A,i}y_{i}\). Therefore, the variance of \({\widehat{\mu }}_{\mathrm {nni}}\) is the same as the variance of \({\widehat{\mu }}_{\mathrm {HT}}.\) This implies that the standard point estimator can be applied to the imputed data\(\{(x_{i},y_{i(1)}):i\in A\}\) as if the \(y_{i(1)}\)s were observed values. Let \(\pi _{A,ij}\) be the joint inclusion probability for units i and j. They showed that the direct variable estimator based on the imputed data:

$$\begin{aligned} {\widehat{V}}_{\mathrm {nni}}=\frac{n_{A}}{N^{2}}\sum _{i\in A}\sum _{j\in A}\frac{(\pi _{A,ij}-\pi _{A,i}\pi _{A,j})}{\pi _{A,ij}}\frac{y_{i(1)}}{\pi _{A,i}}\frac{y_{j(1)}}{\pi _{A,j}} \end{aligned}$$

is consistent for \(V_{\mathrm {nni}}.\)

Yang and Kim (2018) also considered two strategies for improving the nearest-neighbor imputation estimator, one using K-nearest-neighbor imputation (Mack and Rosenblatt 1979) and the other using generalized additive models (Wood 2006). In K-nearest-neighbor imputation, instead of using one nearest neighbor, they identify multiple nearest neighbors in the big data sample and use the average response as the imputed value. This method is popular in the international forest inventory community for combining ground-based observations with imagines from remote sensors (McRoberts et al. 2010). In the second strategy, they investigated modern techniques of prediction for mass imputation with flexible models. They used generalized additive models (Wood 2006) to learn the relationship of the outcome and covariates from the big data and create predictions for the probability samples. We note that this strategy can apply to a wider class of semi- and non-parametric estimators such as single index models, Lasso estimators (Belloni et al. 2015), and machine learning methods such as random forests (Breiman 2001).

4.3.2 Variable selection in the presence of a large number of covariates

In the second type, when there are a large number of variables, there is a large literature on variable selection methods for prediction, but little work on variable selection for data integration that can successfully recognize the strengths and the limitations of each data source and utilize all information captured for finite population inference.

In practice, subject matter experts recommend a rich set of potentially useful variables but typically will not identify the set of variables to adjust for. In the presence of a large number of auxiliary variables, variable selection is important, because existing methods may become unstable or even infeasible, and irrelevant auxiliary variables can introduce a large variability in estimation. Gao and Carroll (2017) proposed a pseudo-likelihood approach for combining multiple non-survey data with high dimensionality; this approach requires all likelihoods be correctly specified and therefore is sensitive to model mis-specification. Chen et al. (2018) proposed a model-based calibration approach using LASSO; this approach relies on a correctly specified outcome model.

Yang et al. (2019) proposed a doubly robust variable selection and estimation strategy. In the first step, it selects a set of variables that are important predictors of either the sampling score or the outcome model using penalized estimating equations. In the second step, it re-estimates the nuisance parameter \((\alpha ,\beta )\) based on the joint set of covariates selected from the first step and considers a doubly robust estimator of \(\mu\), \({\widehat{\mu }}_{\mathrm {dr}}({\widehat{\alpha }},{\widehat{\beta }})\) in (19), where the estimating functions are:

$$\begin{aligned} J(\alpha ,\beta )=\left( \begin{array}{c} J_{1}(\alpha ,\beta )\\ J_{2}(\alpha ,\beta ) \end{array}\right) =\left( \begin{array}{c} N^{-1}\sum _{i=1}^{N}I_{B,i}\left\{ \frac{1}{\pi _{B}(x_{i}^{{\text{T}}}\alpha )}-1\right\} \{y_{i}-m(x_{i}^{{\text{T}}}\beta )\}x_{i}\\ N^{-1}\sum _{i=1}^{N}\left\{ \frac{I_{B,i}}{\pi _{B}(x_{i}^{{\text{T}}}\alpha )}-d_{A,i}I_{A,i}\right\} \partial m(x_{i}^{{\text{T}}}\beta )/\partial \beta \end{array}\right) . \end{aligned}$$
(27)

Importantly, the two-step estimator allows model mis-specification of either the sampling score or the outcome model. In the existing high-dimensional causal inference literature, the doubly robust estimators have been shown to be robust to selection errors using penalization (Farrell 2015) or approximation errors using machine learning (Chernozhukov et al. 2018). However, this double robustness feature requires both nuisance models to be correctly specified. Using (27) relaxes this requirement by allowing one of the nuisance models to be mis-specified. This also enables one to construct a simple and consistent variance estimator (20)\(+\)(22) allowing for doubly robust inferences.

5 Concluding remarks

Data integration is an emerging area of research with many potential research topics. We have reviewed statistical techniques and applications for data integration in survey sampling context. Probability sampling remains as the gold standard to obtain a representative sample, but the measurement of the study variable can be obtained from an independent non-probability sample or big data. In this case, assumptions about the sampling model or the outcome model are required. Most data integration methods are based on the unverifiable assumption that the sampling mechanism for the non-probability sample (or big data) is non-informative (corresponding to the missingness at random in the missing data literature).

If the sampling mechanism is informative, imputation techniques can be developed under the strong model assumptions for the sampling mechanism (e.g., Riddles et al. 2016; Morikawa and Kim 2018). Like the non-informative sampling case, the informative sampling assumption is unverifiable. In such settings, sensitivity analysis is recommended to assess the robustness of the study conclusions to unverifiable assumptions. This recommendation echoes Recommendation 15 of the National Research Council (NRC) report entitled “The Prevention and Treatment of Missing Data in Clinical Trials” (National Research Council 2010). Chapter 5 of the NRC Report describes “global” sensitivity analysis procedures that rigorously evaluate the robustness of study findings to untestable assumptions about how missingness might be related to the unobserved outcome.

When the training dataset has a hierarchical structure, multi-level or hierarchical models can be used to develop mass imputation. This is closely related to unit-level small area estimation in survey sampling (Rao and Molina 2015). The small area estimation is particularly promising when we apply data integration using big data. That is, when we use big data as a training sample for prediction, the multi-level model can be used to reflect the possible correlation structure among observations. The parameter estimates for the multi-level model computed from the big data can be used for predicting unobserved study variables in the survey sample if the same multi-level model can be made. Further research in this direction, including the mean-squared error estimation for this small area estimation, will be a topic of future research.

Finally, the uncertainty due to errors in record linkage and statistical matching is also an important problem. The matched sample using record linkage techniques (Fellegi and Sunter 1969) is subject to linkage errors. Zhang and Chambers (2019) cover several research topics in the statistical analysis of combined or fused data.