1 Introduction

Time constraints and budgetary restrictions are two relevant drivers of the ongoing process of rethinking the classic way of data collection operated nowadays by both the Official Statistics (OS) and researchers from different disciplines. Data collected employing e.g., national censuses are being progressively cast off, while new scenarios for data integration emerge, offering new solutions for the OS, researchers, policymakers, and the general public.

The massive generation of Big Data experienced in the last two decades fosters the idea that information can be collected through countless approaches. Mobile devices, apps, social media, and the Internet of Things concur to offer the idea that there is no need to plan data collections anymore, rather, we only need modelling solutions to exploit the already available information (Iaccarino 2019). Nevertheless, raw data bring pending challenges in terms of quality, constraints due to privacy claims and security reasons, problems of data ownership as well as organizational, technological, and governance issues.

A gradual shift from data collection to data integration is already ongoing, primarily in the OS that is elaborating strategies to keep recursively up-to-date the available data sources, for example, by integrating both primary and secondary data, aggregating administrative registers, web data, project surveys, satellite, and geo-data. Thus, data integration represents the future of the incoming data production and sharing processes. The existing approaches lie in (1) traditional sources and administrative data, (2) traditional sources and Big Data and, (3) micro and macro-level data (UNECE 2017). There are different strategies with a common focus: to intensify the possibility to meet and properly answer the users’ needs, assembling valuable information from multiple sources in a really broad research spectrum (Pentland 2019).

The integration of information originally collected in two (or more) data sources can be performed with different methods. Relevant ones are record linkage (RL), multiple imputation (MI), and statistical matching (SM).

RL consists of the exact and probabilistic approaches (Christen 2012). When two or more different data sources which refer to the same population must be integrated, exact RL allows us to merge on the basis of a common identifier for the units occurring in both data sets. If a record in one data set has exactly the same value in the common identifier as some records in the other data set, exact RL merges the records. This is the simplest case for the integration of different e.g., administrative sources. However, when (1) the sets of units collected by two or more data sources are (at least partially) overlapping, (2) no unique identifiers exist/can be used and, (3) the variables that the data sources have in common can serve as ‘pseudo-identifiers’ but they are misreported or change over time (Fellegi and Sunter 1969), probabilistic RL plays the role of the first actor in the integration process. Therefore, probabilistic RL detects the records of different data sets that refer to the same unit when exact identifiers cannot be used/are not available.

MI handles variables missing values. This is done, at the individual level, by a two steps approach. First, a small number of completed data sets are created and, from an imputation model, missing values are filled in. Second, estimates are computed in each completed data set and, finally, they are combined (Rubin 1987). MI can be used when a partially observed data set must be ‘filled’, for each record, by an estimated substitute of the variable’s value that is randomly generated from the unknown conditional distribution of the missing variable given the observed one, using samples from an imputation model (Murray 2018). MI usually completes the records’ missing entities by exploiting only one data set (Denk and Hackl 2003) and, “roughly speaking, the missing data are imputed more than once (...) being these imputations based on some distributional assumptions” (Rässler 2002, p. 5).

The present paper focuses on statistical matching and the methodology is depicted in detail in the following sections. However, to briefly frame SM in general terms, it allows us to integrate the information contained in two or more data sources when the operational context is characterized by the fact that (1) the different data sources collect information on (i) a set of common variables (\(\varvec{\textrm{x}}\)) and (ii) two sets of variables that are disjointly observed (\(\varvec{\textrm{y}}\) and \(\varvec{\textrm{z}}\)) and, (2) the units observed in the data sets are (potentially) disjoint sets of units (D’Orazio et al. 2006b). If RL deals with ‘the same’ units, SM deals with units that are as much as possible ‘similar’ (Judson 2005). The main difference among SM and RL/MI lies in the final integration goal. RL evaluates the coverage overlap between the data sets or the presence of duplicated records; it is applicable to add/remove records, potentially augmenting data in one source. Compared to MI, the integration focus of SM slightly differs from the one of MI which instead goes beyond the conventional two-databases situation (Judson 2005). Neither RL nor MI deals with the potentially widest goal of SM: building a synthetic (complete) data set from two (or more) data sources. SM creates a data set that is called ‘synthetic’ because it does not come from the direct observation/collection of information or, in other words, it is artificial. On the other hand, it is ‘complete’ in the sense that it ends up aggregating all the variables collected either in one or in the other data source. Furthermore, neither RL nor MI considers the amount of uncertainty behind the integration results, as, in contrast, SM allows us to do.Footnote 1 Moreover, SM serves the purpose(s) of data fusion more flexibly (Rässler 2002), being particularly useful when the missing data structure is such that there is the need to either acquire knowledge on the joint distribution function \(f(\varvec{\textrm{x}}, \varvec{\textrm{y}}, \varvec{\textrm{z}})\) or transferring from one source to the other the missing variable(s), only by exploiting the knowledge on \(\varvec{\textrm{x}}\). In such context, the random variables (r.v.s) \(\varvec{\textrm{y}}\) are observed only in one data set, while the r.v.s \(\varvec{\textrm{z}}\) are observed only in another. The random variables \(\varvec{\textrm{x}}\) are observed in all the data sets at our disposal and, hereinafter, we call them ‘matching’ variables.

Nowadays, however, fruitful combinations of SM and RL/MI are emerging to deal with the challenges offered by Big Data integration, a field where non-probability samples must be considered (to date, Bethlehem 2016 identifies the main practical issues existing when matching different samples by dealing with mass imputation, while Rao 2021 offers an exhaustive review of the probability sampling methods, also, by focusing on models which bring valid inference from non-probability samples).

Two textbooks offer a cohesive dissertation of statistical matching: Rässler (2002) and D’Orazio et al. (2006b). The contribution of the present paper to the literature on SM and, more generally, to data integration is twofold. First, it reviews the latest SM developments and discusses the main findings on the identification, quantification, and treatment of the uncertainty behind data integration. Second, the paper considers these topics by covering the earliest SM developments up to the latest published articles. Both the methodological peculiarities and the SM strengths and weaknesses are discussed. The implications of the sampling frameworks in integrating data are investigated. The existing real data applications are not disregarded. The paper aims to provide casual readers as well as the interested ones with a useful map for the complete understanding of the method concerning its several shades of application.

To simplify the reader’s journey across statistical matching, having a clearer look through the complex development and achievements of the method under a complete and shared theoretical framework, Table 1 shows the crucial contributions constituting the backbone of the SM state-of-the-art. These works and the others cited are listed in the References, but a schematic reading of the SM literature is proposed by grouping the most relevant contributions by the decades from the ’70s up to nowadays, also, by following the macro-area subjects interested by the SM developments. Table 1 is meant to offer a clearer understanding of the SM evolution and to help the readers to efficiently focus on the main aspects characterising the method.

Table 1 Statistical matching: from the origins to the consecration, by schematically reading the state-of-the-art

The paper is structured as follows. Section 2 briefly describes the method, highlighting its key features and presenting the two main SM goals (micro and macro). Section 3 reviews the ‘merging’ approach of the origins, discussing the need for formal cohesion that the preliminary SM proposals left aside. Section 4 investigates the non-parametric and Bayesian approaches, discussing the solutions offered to the problems of matching noise quantification and uncertainty definition/estimation. The uncertainty in SM is then analysed according to the most recent proposals in Sect. 5, with a specific focus on both the non-representative samples and the problems related to Big Data. Section 6 provides the concluding remarks and some considerations about further SM developments. Appendix A offers an overview of the most considerable SM applications and software solutions.

2 Method in brief

For the sake of simplicity, let’s consider two data sets: A and B. Let A be the ‘recipient’ data set, while B is the ‘donor’ data set. The number of observations in the two data sets is \(n_\textrm{A}\) and \(n_\textrm{B}\), respectively. Let \(\varvec{\textrm{x}}\) be the random variables observed both in A and in B; \(\varvec{\textrm{y}}\) are the r.v.s observed only in A, while \(\varvec{\textrm{z}}\) are the r.v.s observed only in B. These r.v.s refer to the i-th and j-th observations collected in A and in B, with \(i = 1, \ldots , n_\textrm{A}\) and \(j = 1, \ldots , n_\textrm{B}\). Therefore, the observed r.v.s are

  • \(\varvec{\textrm{x}}\) = \(\{X_1, \ldots , X_l, \ldots , X_L\}\), collected both in A and in B, (being \(X_{l}^{\textrm{A}}\) a vector of dimension \(n_\textrm{A}\), while \(X_{l}^{\textrm{B}}\) is a vector of dimension \(n_\textrm{B}\)).

  • \(\underset{n_{\textrm{A}}\times M}{\varvec{\textrm{y}}} = \{Y_{1}^\textrm{A}, \ldots , Y_{m}^\textrm{A}, \ldots , Y_{M}^\textrm{A}\}\), collected only in A (being \(\hbox {Y}_{m}^{\textrm{A}}\) a vector of dimension \(n_\textrm{A}\)).

  • \(\underset{n_{\textrm{B}}\times P}{\varvec{\textrm{z}}} = \{Z_{1}^\textrm{B}, \ldots , Z_{p}^\textrm{B}, \ldots , Z_{P}^\textrm{B}\}\), collected only in B (being \(Z_{p}^\textrm{B}\) a vector of dimension \(n_\textrm{B}\)).

Therefore, the data sets at hand are A = \(\bigg \{\) \(\underset{n_\textrm{A}\times L}{\varvec{\textrm{x}}^\textrm{A}}\), \(\underset{n_\textrm{A}\times M}{\varvec{\textrm{y}}^\textrm{A}}\) \(\bigg \}\) and B = \(\bigg \{\) \(\underset{n_\textrm{B}\times L}{\varvec{\textrm{x}}^\textrm{B}}\), \(\underset{n_\textrm{B}\times P}{\varvec{\textrm{z}}^\textrm{B}}\) \(\bigg \}\). The whole set of information that we have at hand is depicted in Fig. 1.

Fig. 1
figure 1

Data at hand in a statistical matching problem with two data sets (A and B)

Statistical matching is applied to aggregate the information collected from different sources, by using two approaches: micro and macro (D’Orazio et al. 2006b). For the sake of simplicity, let’s assume that \(l = 1\), \(m = 1\), and \(p = 1\). In other words, let X, Y, and Z be univariate, continuous variables. Being \({\mathcal {F}}\) a family of distributions with each \(f(X, Y, Z; \varvec{\theta })\) \(\in \) \({\mathcal {F}}\) defined by a vector of parameters \(\varvec{\theta } \in \Theta \), macro SM aims at estimating the joint distribution function f(XYZ). On the other hand, micro SM aims at generating a synthetic (complete) data set from A and B. Whereas the former purpose should be clear, the latter deserves more explanation. Let \(S_d\) be a generic subset of d variables of interest (with \(d = 1, \ldots , P\)) chosen among the r.v.s \(\varvec{\textrm{z}}\). The goal of micro SM is imputing \(S_d\) from B to A and thus, generating the synthetic (complete) data set, named C, such that C = \(\bigg \{\) \(\underset{n_\textrm{A}\times L}{\varvec{\textrm{x}}^\textrm{A}}\), \(\underset{n_\textrm{A}\times M}{\varvec{\textrm{y}}^{\textrm{A}}}\), \(\underset{n_\textrm{A}\times S_d}{\varvec{\textrm{z}}^\textrm{A}}\) \(\bigg \}\).

In the most general SM framework, let assume that

  1. A.1.

    A and B collect information on two representative samples of the same target population (D’Orazio et al. 2006b).

  2. A.2.

    The distinct samples A and B depicted in Fig. 1 can be considered as a unique sample A \(\cup \) B of the \(n_\textrm{A}+n_\textrm{B}\) i.i.d. observations from f(XYZ) (D’Orazio et al. 2006b).

  3. A.3.

    From the overall sample given by A \(\cup \) B, i.e., the sample of \(n_\textrm{A}+n_\textrm{B}\) units from f(XYZ), a synthetic (complete) data set can be derived where the structure of missing information is missing completely at random (MCAR) or missing at random (MAR) (Rubin 1987; Rässler 2002, 2004).

The key estimation problem related to f(XYZ) has been often approached by resorting to the identifiable model derived from the conditional independence assumption (CIA). Briefly, the whole information set is defined by the A \(\cup \) B sample and X, Y, and Z are independent and normal distributed r.v.s. Usually, CIA has been explicitly or implicitly adopted in SM for decomposing the aforementioned estimation challenge into smaller estimation problems by the factorization of the likelihood function (Anderson 1957). First, the solution was limited to the trivariate normal, while it was successively extended to multivariate distributions. Indeed, Rubin (1974) demonstrated that \(f(\varvec{\textrm{x}}, \varvec{\textrm{y}}, \varvec{\textrm{z}})\) is decomposable such that: \(f(\varvec{\textrm{x}}, \varvec{\textrm{y}}, \varvec{\textrm{z}}; \varvec{\theta })\) = \(f({\varvec{x}}; \varvec{\theta }_{\varvec{\textrm{x}}}) \cdot f({\varvec{y}} \vert {\varvec{x}}; \varvec{\theta }_{\varvec{\textrm{y}} \vert \varvec{\textrm{x}}}) \cdot f({\varvec{z}} \vert {\varvec{x}}; \varvec{\theta }_{\varvec{\textrm{z}} \vert \varvec{\textrm{x}}})\). In other words, CIA allows computing the maximum likelihood estimator (MLE) for \(\varvec{\theta }_{\varvec{\textrm{x}}}\) from the A \(\cup \) B sample, while the MLEs for \(\varvec{\theta }_{\varvec{\textrm{y}} \vert \varvec{\textrm{x}}}\) and \(\varvec{\theta }_{\varvec{\textrm{z}} \vert \varvec{\textrm{x}}}\) are computed from A and B, respectively.

3 Where is the cohesion?

3.1 Statistical matching when it was just ‘merging’

The key estimation problem in SM and a few solutions to overcome it were known since the ’50s. However, only the availability of electronic computers gave spread to the first merging/matching applications: “the increased interest in social problems at the microeconomic level, as well as the chances offered by the developing technologies, fostered the demand for disaggregated socio-economic and demographic information” (Okner 1972, p. 325). The ‘1966 merge file’ represents an early, rough approach in this direction. It was built from the 1967 Survey of Economic Opportunity and the 1966 Tax File referred to the U.S. families (Okner 1972) in response to the lack of consistent and comprehensive set of household data. The author answered to the need for official statistics about the distribution of the U.S. personal income or cross-classification by typical demographic characteristics of the population. He used the matching variables \(\varvec{\textrm{x}}\) (wage, salary income, farm income...)Footnote 2 to set up a system of equivalence classes based on major and minor income sources and consistency scores. These, in turn, are used to assign ‘points’ to the units, thus to match them obtaining a new data file where the punctual pattern of income is recorded in addition to people’s characteristics.

Sims (1972a, 1972b) criticized such an approach since conditional independence was just implicitly assumed, while a more explicit theoretical framework was needed. Little is said about the outcome ‘quality’ and validity. Very limited considerations involved the adjustments required for evaluating the under-reporting or non-reporting that was eventually present in the original survey file. In this regard, a small improvement is offered by Alter (1974) who proposed to evaluate the concordance of the after-matching variables using cross-tabulation, integrating the 1970 Canadian Survey of Consumer Finances with the Family Expenditure 1970 Survey. However, the same author stressed that “the XYZ problem remains unsolved (...) since a joint distribution of X, Y, and Z cannot be inferred from the known distributions of X with Y, and X with Z” (Alter 1974, p. 374).

Other pending challenges which were not taken into account by these contributions are related to the (implicit) assumption that the vector \(\varvec{\textrm{x}}\) is defined exactly in the same way in A and in B, although the matching variables may be affected by errors of different types, magnitude and frequency of occurrence.Footnote 3 Moreover, the peculiar, but not uncommon cases (e.g., in social sciences) of composite \(\varvec{\textrm{x}}\) or those of composite \(\varvec{\textrm{z}}\) and \(\varvec{\textrm{y}}\) were considered neither.Footnote 4

An important residual issue is related to the rationale applied for the selection of the matching variables. Trivially, the choice of \(\varvec{\textrm{x}}\) was often data-driven, guided by the explanatory power of R\(^2\). Using the coefficient of determination to assess the relationship strength between (XY) or (XZ) (and, hence, the validity of the CIA) in an imputation by regression is straightforward. However, the cases in which we are not interested in assigning mean values but, rather, we want to reproduce the distributions of values in the original data and transfer complex sets of information do present further challenges. This was clear to Ruggles and Ruggles (1974) who made explicit that for matching purposes, no specific functional relationship must be determined in advance. Therefore, how to select the matching variables in the most efficient way? The authors proposed to match on the L-dimensional cross-tabulation using all the \(\varvec{\textrm{x}}\) variables between A and B. The matches will then be made stochastically with respect to the units which fall in the same cell. The assessment of the quality of the X variable intervals (i.e., the assurance that, within a specific interval of X, the distribution of Y and Z are invariant) is done with the Chi-square test on the Y and/or Z distributions. When significant differences are found, a correlation measure is computed to estimate how much the distributions differ. It is worth noticing that this discussion about the ‘quality’ of \(\varvec{\textrm{x}}\), also, embeds the considerations on the overall goodness and reliability of the final synthetic (complete) data set.

The aforementioned proposals have a common root: the (implicit) use of a pseudo-distance that is based on a hierarchically nested set of cross-tabulated cells, built on the variables \(\varvec{\textrm{x}}\) which are in common between A and B. Indeed, by successively partitioning these variables in narrower intervals, it is possible to tag and re-tag the units, then, by sorting the tags, the units can be selected for matching. This pseudo-distance is then similar to the weights attached to each matching variable X in a multivariate regression analysis that uses \(\varvec{\textrm{x}}\) (regressor) and \(\varvec{\textrm{y}}\) and \(\varvec{\textrm{z}}\) (dependent variables).

The works discussed so far proposed the SM approach under a methodological framework that was different from that of Record Linkage, even if the procedures implemented were often named equivocally like ‘linkage’, ‘fusion’, ‘concatenation’, etc. (Rässler 2002). The first, specific matching proposal that was presented within a framework characterised like the one described in Sect. 2 appeared only later, in Kadane (1978). Considering a triple of normal-distributed variables (XYZ), the author concludes that the assumption of joint normality leads to the fact that all the regressions (XY), (XZ) are linear, which is unlikely when real-world data are used. The solution proposed as a “way around the problem” (Kadane 1978, p. 424) is thus to adopt the aforementioned assumption limitedly, region-by-region, in the X space and, hence, to resort to separate estimates of the covariances (but for \(\sigma _{YZ}\) that is unobservable). Residual cases in which the information on \(\sigma _{YZ}\) can be retrieved consist of coarse samples which are yet perfectly matched, from which certain elements of \(\sigma _{YZ}\) can be known or, opting for the CIA. However, given that \(\sigma _{YZ}\) cannot be consistently estimated from the data at hand, the solution proposed by the author was to use a particular value for \(\sigma _{YZ}\) with the goal of getting results that would yield to a certain expected value (e.g., the expected amount of taxes willing to be raised by a specific tax schedule). Making assumptions about the distribution of \(\sigma _{YZ}\) and, hence, taking values for the latter from the distribution, finally bring results which could be weighted with the probability of the particular value of \(\sigma _{YZ}\), such that \(\sigma _{YZ}\) is sampled. A drawback of this approach is that the more the assumption of normality loses reliability, the more methodological coherence diminishes. In addition, the proposed solution disregards any consideration about the validation of the final integration outcome.

3.2 The ascent of a method: applications from national agencies

The shortage of both theoretical foundations and empirical justification in Statistical Matching was made explicit for the first time by Rodgers (1984), pointing out that any finding that is drawn from the matched data sets is questionable as far its validity largely depends on the assumptions made, at first, on the matched variables. However, since these assumptions cannot be tested, it is compulsory to check for the consequences of the possible lack of validity. Then, the author proposed a first attempt for a cohesive SM notation, introducing the fundamental concepts of distance function and donation classes. A distance function is defined as the absolute difference in the values of X (e.g., age) computed between two observations that come from different data sets. For example, a generic, basic distance is definable by \(\vert x_i - x_j \vert \), for \(i = 1, \ldots , n_{\textrm{A}}\) and \(j = 1, \ldots , n_{\textrm{B}}\).Footnote 5 The donation classes are defined as homogeneous sub-groups of observations that help restrain the matched units’ pairs (for example, by partitioning the units between male and female gender). Given that the data sets at hand collect information on the whole set of pairs made by the \({n}_{\textrm{B}}^{{n}_{\textrm{A}}}\) combination of donors and recipients, let \(X^\star \) be a discretized variable whose categories \(X^\star _f\), with \(f = 1, \ldots , F\), identify the donation classes such that the size of the potential number of donor-recipient pairs can be restricted to \((n_{\textrm{B},X ^\star _f})^{n_{\textrm{A},X ^\star _f}}\).

Rodgers (1984) also hinted at a new integration perspective based on the usefulness of SM raising the topic of the validity of findings which result from analyses based on statistically matched data. Such a validity strongly depends on the accuracy of the underlying assumptions about the relationships between the variables. Given that A and B, separately considered, do not contain information about the relationships among variables \(\varvec{\textrm{y}}\) and \(\varvec{\textrm{z}}\), and SM only reflects the assumptions (implicit or explicit) made during the matching procedure, the matched data set we end up with is “a risky basis for analyses of such relationships” (Rodgers 1984, p. 96). The author considers SM simulations and empirical applications in different scenarios, testing the validity of CIA and discussing how much confidence to be placed in matching procedures and when, according to the set of variables at disposal. Namely, the topics of (1) unconstrained or constrained matching, (2) which matching variables to include in a distance function and, (3) the minimum size of the input data set that is required to carry out a matching process are investigated.

The integrated data set is valuable for the analyses which involve the relationships on \(\varvec{\textrm{x}}\), \(\varvec{\textrm{y}}\), and \(\varvec{\textrm{z}}\) as far as the assumptions on such relationships made (or implied) by the analyst for the scopes of the integration procedure are robust. For example, let the case be that the following linear regression model has to be estimated: \(\varvec{\textrm{z}} = \varvec{\textrm{x}}_{L-1} \cdot \varvec{\beta } + \varvec{\textrm{y}} \cdot \varvec{\lambda } + \varvec{\epsilon }\) (on the left side of the equation we have endogenous outcome variables which are explained by both endogenous and exogenous explanatory variables—on the right side of the equation—; there are vectors of non-zero parameters plus stochastic errors).Footnote 6 As Klevmarken (1982) points out, the possibility of estimating the parameters of such a linear model depends on the availability, for each Y included in the model, of at least one of the r.v.s \(\varvec{\textrm{x}}\) used in the argument of the distance function that is excluded from the set of \(\varvec{\textrm{x}}_{L-1}\) variables. Briefly, let the case be that \(\varvec{\textrm{x}}\), \(\varvec{\textrm{y}}\), and \(\varvec{\textrm{z}}\) are all included in the system defined by the expression \(\varvec{\textrm{x}} \cdot {\varvec{B}} + \varvec{\textrm{y}} \cdot \varvec{\Lambda } + \varvec{\textrm{z}} \cdot \varvec{\Gamma } = \varvec{\textrm{U}}\), where we have the parameter matrices as well as the matrix of stochastic disturbances \(\varvec{\textrm{U}}\), with \(E (\varvec{\textrm{U}}) = {\varvec{0}}\) (\(\varvec{\textrm{y}}\) and \(\varvec{\textrm{z}}\) are endogenous, \(\varvec{\textrm{x}}\) exogenous). This system clearly includes the previous equation, while another component is \(\varvec{\textrm{y}} = \varvec{\textrm{x}} \cdot \varvec{\pi } + \varvec{\textrm{V}}\), a reduced form of the complete system where \(\varvec{\pi }\) and \(\varvec{\textrm{V}}\) are the corresponding sub-matrices of the parameters of interest. From the data set (sample) A, it is possible to estimate \(\varvec{\pi }\) and, hence, predict the values \(\hat{\varvec{\textrm{y}}}^{\textrm{B}}\) based on the observed values of the matching variables. In addition, using sample B it is possible to estimate \(\varvec{\textrm{z}}^{\textrm{B}} = \varvec{\textrm{x}}^{\textrm{B}}_{L-1} \cdot \varvec{\beta } + \hat{\varvec{\textrm{y}}}^{\textrm{B}} \cdot \varvec{\lambda } + \varvec{\epsilon }\). By rewriting the latter equation as \(\varvec{\textrm{z}}^{\textrm{B}} = \varvec{\textrm{M}}^{\textrm{B}} \cdot \varvec{\delta } + \varvec{\epsilon }\), where \(\varvec{\textrm{M}}^{\textrm{B}} = (\varvec{\textrm{x}}^{\textrm{B}}_{L-1} \vert \hat{\varvec{\textrm{y}}}^{\textrm{B}})\) and \(\varvec{\delta } = (\varvec{\beta } \vert \varvec{\lambda })\), the parameters in \(\varvec{\delta }\) can be estimated by ordinary least square, i.e., \(\hat{\varvec{\delta }} = (\varvec{\textrm{M}}' \varvec{\textrm{M}})^{-1} \varvec{\textrm{M}}' \varvec{\textrm{z}}\), given that the inverse matrix (\(\varvec{\textrm{M}}' \varvec{\textrm{M}})^{-1}\) exists. Therefore, the issue is that the rank of (\(\varvec{\textrm{M}}' \varvec{\textrm{M}})^{-1}\) cannot exceed the number of variables \(\varvec{\textrm{x}}\), implying that at least as many of the matching variables be omitted from \(\varvec{\textrm{z}} = \varvec{\textrm{x}}_{L-1} \cdot \varvec{\beta } + \varvec{\textrm{y}} \cdot \varvec{\lambda } + \varvec{\epsilon }\), as the number of \(\varvec{\textrm{y}}\) variables that are included.

Yet on the assessment of matching goodness, limited to categorical matching, Walter (1984) investigated the sampling effort required to obtain matches for all the units in a given sample. Using Markov chains, the author derived the first two moments of the exact distribution of the sample size required to complete the match quotas in all the categories of the chosen matching variable(s). Hence, the dependence of the matching difficulty in relation to samples size, the number of matching categories, as well as the distributions of category probabilities and quotas are considered. The author approached the problem assuming that the sample from the first population is given and by sampling the second population repeatedly (until all units of the first sample have been matched). Walter (1984) demonstrates that (1) for a fixed number of matching categories, larger samples are easier to match than small ones, (2) the mean sample size increases with the number of matching categories and, (3) matching gets easier if the category sampling probabilities are proportional to their quotas.

The latter situation occurs often when the matching variables are distributed similarly and there are only weak confounders. In addition, substantial oversampling is anticipated whenever the category probabilities and the quotas are far from being proportional. The problem is then the required, optimal degree of similarity that must exist between the matching variables in samples A and B in order to carry out an easier matching and not lose precision. In this regard, the case of continuous \(\varvec{\textrm{x}}\) emerges, while it was disregarded so far. In this direction, no further developments were proposed during the ’80s: most of the contributions focussed on real data applications, with national departments and federal offices of the U.S. and Canada in the front line (see, for example, Radner et al. 1980; Rodgers and DeVol 1981; Gavin 1985; Armstrong 1989, and the references therein).

Despite these gaps, the practical contributions of the ’80s still had a fundamental role in moving forward the whole SM framework which began to be thought, at that time, like a ‘file-merging technique’ distinct from Record Linkage. For example, Barry (1988) points out that RL is an ‘exact matching’ method, stressing that it is structured on pseudo-identifiers that allow linking entities from different data sources. In contrast, statistical matching deals with units that are similar but not (necessarily) the ‘same’. The main goal of SM was stated such as “integrating data on an individual, from one source, with data on a different observation (from another source) if the two units are identified as the best matching or the most similar units” (Gavin 1985, p. 183). In other words, it became clear that statistical matching contrasts RL “because the set of units in the two files for statistical matching may be completely disjoint or have only a small unknown overlap” (Singh et al. 1988, p. 672). Such considerations spread light on the usefulness of SM for integrating data, particularly when privacy claims constraints do hold. Indeed, consequently to the data privacy concerns and the growing debate on this topic, a progressive shortage/lack of information (e.g., due to the removal of units’ unique identifiers) made more complex the linkage among records from different data sources fostering, in turn, the diffusion of new data integration solutions.

To completion of the SM framework, Singh et al. (1988) and Armstrong (1989) analysed alternative approaches to SM, above all, through log-linear modelling imputation. The novelty of the proposal lies in the estimation of the conditional distribution that must be imputed in the categorical framework represented by \(f(X^\star , Y^\star , Z^\star )\), where the star-variables are categorical covariates, and the related joint distribution \(f(\varvec{\cdot })\) is a probability mass function. The idea is to transform the classic SM problem related to f(XYZ), to one that involves the categorical variables. After having suitably selected the partitioning of the (XYZ) space into categories for \(f(X^\star , Y^\star , Z^\star )\), first, Z is imputed up to a \(Z^\star \) category by exploiting \(f( Z^\star \vert X^\star )\) within the imputation class \((X^\star , Z^\star )\), second, a value of Z within the \(Z^\star \) category is chosen. The main advantage of such an approach is that CIA violations can be easily controlled in a categorical framework that is ‘approximately the same’ of f(XYZ). Hence, a subset of X as suitable predictors can be obtained, ending up with optimal imputation classes (as per an instability measure definable on the coarseness of the categorical partitions).

Such a solution is particularly relevant when the integration is oriented towards microsimulation models and there is a need for specific information that has a low probability to occur. In this sense, non-exhaustive examples can be high-income observations, frequent response errors, and/or poor information details. These problems can be addressed by feeding microsimulation modelling with SM imputation such that (1) the computational effort required by data integration can be reduced and, (2) the potential drawbacks from non-linear relationships among XY, Z (which could bias the results obtained from the analysis of the integrated data set) may be avoided or mitigated by means of a punctual control of the transformed categorical framework (Cohen 1991).

3.3 Finally, it came the methodology

“Micro-simulation databases which are frequently used by policy analysts and planners, are created by several datafiles that are combined by Statistical Matching” (Singh et al. 1993, p. 59), a method whose development was drastically speeded up by the discussion of the simulation results of three, alternative SM ‘techniques’: regression-based, distance-based, and log-linear ones. The empirical evidence offered by Singh et al. (1993) suggested that distance-based SM (i.e., the hot deck techniques that are discussed in detail in Sect. 4) performs better than regression-based SM. In contrast, log-linear methods should be preferred if auxiliary information is available and, hence, CIA can be relaxed by adopting categorical constraints. Similar conclusions are drawn by Schulte Nordholt (1998) who compared the results from simulations and real-world applications using Dutch data. Significantly, to date, this work can be considered, in addition to Renssen (1998), the first SM application with data referred to a European country (Netherlands), beyond the original German and French ‘data fusion’ attempts of the late ’80s/first ’90s (discussed in Appendix A).

At the dawn of the new millennium, considerations like “statistical matching has been widely used by practitioners without always adequate theoretical underpinnings” (Moriarity and Scheuren 2001, p. 407) and “throughout the world, today, we find synonyms used to describe the Statistical Matching process including ‘data fusion’, ‘data merging’ or ‘data matching’, ‘mass imputation’, ‘microsimulation modelling’ and ‘file concatenation’ (...) with a dragged discussion about a suitable and clarifying denotation” (Rässler 2002, p. 2) were suggesting the urge for more cohesion in SM formalization. The widespread idea behind SM was that it was largely used, mainly for practical purposes, by the OS (Moriarity and Scheuren 2003), as also Rässler (2002) states: “much of the literature describing traditional approaches and techniques are working papers, technical or internal reports” (p. 44).

Two main contributions answered the urge for more solid theoretical foundations: Rässler (2002) and D’Orazio et al. (2006b). They provided a cohesive theoretical framework for the SM methodology, discussing the main implications of the CIA and the use of auxiliary information (D’Orazio et al. 2006b), and comparing several alternatives to SM with a specific focus on Bayesian solutions (Rässler 2002). The latter was developed by Rässler (2003) who adapted and further implemented the framework of Rubin (1987), employing a non-iterative Bayesian alternative to his regression model.

The key challenge in SM became then finding a reliable alternative to the CIA. By approaching the SM problem as a non-response issue, the core idea embraced the fundamental ‘identification problem’. Whereas the missing mechanism is ignorable, the association of the variables which are not jointly observed is not identifiable and, hence, it cannot be likelihood-estimated. Therefore, either there is additional information on f(XYZ), or the researcher must resort to several imputations, eventually based on informative priors. In such a context, Rässler (2004) proposed to frame the identification problem according to four levels of validity that SM may achieve. Namely, they are: 1st level—Preserving the individual values; 2nd level—Preserving the joint distributions; 3rd level—Preserving the correlation structures; 4th level—Preserving the marginal distributions. Usually, the latter level is the one that can be widely controlled in SM. If the conditional association (i.e., the one of the variables not jointly observed, given the variables in common between A and B) cannot be estimated from the data at hand, admissible values for the unconditional association of Y and Z can be estimated instead. How? Depending on the explanatory power of the matching variables, smaller/wider range of admissible values can be estimated (Rässler 2004).

4 The non-parametric and Bayesian approaches

During the last two decades, non-parametric SM gained relevant attention due to the fact that (1) it exploits, entirely, the ‘live’, observed information (D’Orazio et al. 2006b), (2) it reduces the possible model misspecification bias deriving from the assumption(s) made on the parameters of the joint family distribution f(XYZ) (Conti et al. 2017b) and, (3) it decreases the computational effort required by parametric SM (D’Orazio 2015). Even though non-parametric techniques require no ‘assumption’, it should be noticed that their application undergoes (1) the choice of the distance function to apply, (2) whether (and how) to build donation classes or not and, (3) the sampling mechanism for the selection of donors.

The methodological advances conveyed by the non-parametric SM are related to the concept of ‘matching noise’, its definition, quantification, and to the role that it plays in the integration procedure. The starting point is the joint distribution f(XYZ) obtained after statistical matching which may not coincide with the ‘true’ (unobservable) distribution. Hence, “the imputed data set is not a real data set and the statistical conclusions drawn from it are questionable” (Marella et al. 2008, p. 1593). Whenever the two distributions differ, there is matching noise and researchers must aim at minimizing it.

We have two data sets: \(\hbox {A} = \bigg \{ \underset{n_\textrm{A}\times L}{\varvec{\textrm{x}}^\textrm{A}} \bigg \}\) and \(\hbox {B} = \bigg \{ \underset{n_\textrm{B}\times L}{\varvec{\textrm{x}}^\textrm{B}}\), \(\underset{n_\textrm{B}\times P}{\varvec{\textrm{z}}^\textrm{B}} \bigg \}\). Let the case be that we want to build the synthetic (complete) data set.

\(\hbox {C} = \bigg \{ \underset{n_\textrm{A}\times L}{\varvec{\textrm{x}}^\textrm{A}}\), \(\underset{n_\textrm{A}\times S_d}{\varvec{\textrm{z}}^\textrm{A}} \bigg \}\). Marella et al. (2008) pointed out that this generated data set will result from the distribution \(f(\varvec{\textrm{x}}^{\textrm{A}},\varvec{\textrm{z}}^{\textrm{A}}) = \int f({\varvec{x}}^\textrm{B}_{{\tilde{j}}}\vert {\varvec{x}}^\textrm{A}) f({\varvec{z}}^\textrm{A}\vert {\varvec{x}}^\textrm{B}_{{\tilde{j}}}) \textrm{d}{\varvec{x}}^\textrm{B}_{{\tilde{j}}}\), where \({\varvec{x}}^\textrm{B}_{{\tilde{j}}}\) are the r.v.s observed for the donor units matched with the recipient ones (i.e., the \({\tilde{j}}\)-th donor that has been matched with the i-th recipient), while \({\varvec{z}}^\textrm{A}\) are the r.v.s imputed in the recipient data set based on the matched units’ pairs. It follows that the matching noise is a composite element of the donor distribution \(f({\varvec{x}}^\textrm{B}_{{\tilde{j}}}\vert {\varvec{x}}^\textrm{A})\) and the values of the imputed variables observed for the matched donor.

Conti et al. (2008) compared the performances (in terms of matching noise minimization) that are obtained from different non-parametric imputation strategies: hot deck techniques, k-Nearest Neighbour method (kNN), and local linear regression (when the assumption of linearity for the underlying population regression function is not mandatory). In addition, in previous works, the authors considered, specifically, the kNN method (Marella et al. 2008), evaluating the matching noise produced by imputation with both a fixed and variable number of donors. In the former case, the class of imputation procedures that includes distance-based and random hot deck techniques is defined by assuming that the k donors to a unit \(i \in \) A are given by the k nearest neighbours of \(\varvec{\textrm{x}}_i\) in B (with \(i = 1, \ldots , n_{\textrm{A}}\)). Let d be the Euclidean distance such that \(d(\varvec{\textrm{x}}^{\text {A}}_i, \varvec{\textrm{x}}^{\textrm{B}}_j) = \left[ (\varvec{\textrm{x}}^{\text {B}}_j - \varvec{\textrm{x}}^{\text {A}}_i)' \varvec{\textrm{D}} (\varvec{\textrm{x}}^{\text {B}}_j - \varvec{\textrm{x}}^{\text {A}}_i) \right] ^{1/2}\), with \(\varvec{\textrm{D}}\) being a positive definite matrix. The k nearest neighbours of \(\varvec{\textrm{x}}^{\textrm{A}}_i\) are the \(k \geqslant 1\) observations in B which result to be the closest to \(\varvec{\textrm{x}}^{\textrm{A}}_i\) according to d, i.e., the observations \(\varvec{\textrm{x}}^{\textrm{B}}_{j(i)} = (\varvec{\textrm{x}}^{\textrm{B}}_{j_1(i)}, \ldots , \varvec{\textrm{x}}^{\textrm{B}}_{j_k(i)}) \). With a number of fixed donors, it happens that some donors could be sparse and hence, the kNN method brings observations which are far from \(\varvec{\textrm{x}}^{\textrm{A}}_i\) to be equally informative on \(\varvec{\textrm{z}}^{\textrm{A}}_i\). The authors suggest that it is fruitful that the optimal value of k varies with \(\varvec{\textrm{x}}^{\textrm{A}}_i\) to allow a different number of donors k to be matched with each \(\varvec{\textrm{x}}^{\textrm{A}}_i\). This is done by fixing a threshold such that the observations which have a distance \(d(\varvec{\textrm{x}}^{\text {A}}_i, \varvec{\textrm{x}}^{\textrm{B}}_j)\) smaller than the threshold are selected to be neighbours of \(\varvec{\textrm{x}}^{\textrm{A}}_i\).

The simulation study results of Marella et al. (2008) hint at using (large) donor data sets with a variable number of k donors, possibly adjusting the mean imputation with residuals. In Conti et al. (2008), to evaluate the closeness between the data-generating model and the imputation-generating model, the authors propose a simulation study elaborating a Kolmogorov–Smirnov distance-based measure of divergence. Results show that kNN performs the worst when there are fixed k donors, underestimating variability since \(f({\varvec{z}}^\textrm{A}\vert {\varvec{x}}^\textrm{B}_{{\tilde{j}}}) \textrm{d}{\varvec{x}}^\textrm{B}_{{\tilde{j}}}\) is condensed on the expectation of \(\varvec{\textrm{z}}\vert \varvec{\textrm{x}}\). Moreover, the authors suggest preferring local linear regression estimators when a complex functional relationship holds between the variables.

As per the non-parametric SM, the approach proposed by Rässler (2002) represents a turning point for the suitable ‘alternatives’ to traditional SM. Indeed, the author innovated the SM framework by embedding it in Multiple Imputation, by proposing the transfer of information through Bayesian inference, while the results are validated in a frequentist way. Trivially, Rässler’s starting point was the need for providing public use files for end-users by integrating two or more data sources. She stressed that the ‘public use file’ is characterized by the fact that the matched data are passed forward to others, usually outside the OS. Therefore, file users/data analysts often differ from the user who made the imputation. This problem poses a classical imputation challenge that, the authors says, cannot be solved by weighting, calibration, or the EM algorithm (Rässler 2002). Indeed, the SM problem cannot be handled by observed-data likelihood nor by the EM algorithm without making explicit assumption about the variables which are never jointly observed. Due to the inestimability of certain parameters (whenever the underlying model cannot be specified by the data at hand), SM poses a problem of identification, i.e., there are several feasible associations potentially describing the joint distribution of the variables not jointly observed.

At its core, the identification problem treated by Rässler (2002) can be framed as follows. Let n be a sample of individuals. To them, a question is asked and \(n_0\) represents the number of people refusing to answer. We are interested in an outcome variable W taking values 0, 1. Let p be the proportion of the \(n_1 = n - n_0\) individuals for whom we observe \(W = 1\). Aiming to estimate \(\textrm{P} (W = 1)\), often we resort just to p. Consequently, the unobserved outcomes of W have the same distribution of the observed ones. But, if we consider p only as a good estimate of \(\textrm{P} (W = 1 \vert R = 1)\), where \(R = 1\) indicates that the outcome variable W has been observed, while \(R = 0\) indicates that a person decided not to answer, we have that \(\textrm{P}(W = 1) = \textrm{P}(W = 1, R = 1) + \textrm{P}(W = 1, R = 0) = \textrm{P}(W = 1 \vert R = 1)\textrm{P}(R = 1) + \textrm{P}(W = 1 \vert R = 0)\textrm{P}(R = 0)\). Of course, \(\textrm{P}(W = 1 \vert R = 0)\) is not known and, by using p as an estimate for \(\textrm{P}(W = 1)\), we are assuming that \(\textrm{P}(W = 1 \vert R = 0)\) is also estimated by p. However, what is known is just that \(\textrm{P}(W = 1 \vert R = 0)\) lies in [0, 1] and, thus, the lower and upper bounds of \(\textrm{P} (W = 1)\) can be estimated by \(\textrm{P}(W = 1) \le \textrm{P}(W = 1 \vert R = 1) \textrm{P}(R = 1) + \textrm{P}(R = 0)\) and \(\textrm{P} (W = 1) \ge \textrm{P}(W = 1 \vert R = 1)\textrm{P}(R = 1)\). \(\textrm{P}(R = 1)\) and \(\textrm{P}(R = 0)\) can be estimated by \(n_1/n\) and \(n_0/n\), respectively, thus the bounds are estimated by \(p \frac{n_1}{n} \le \hat{\textrm{P}}(W = 1) \le p \frac{n_1}{n} + \frac{n_0}{n}\). This concept of ‘identification’ is further developed by Rässler (2002) who uses MI to estimates upper and lower bounds of the unconditional association.

The solution proposed is based on Bayesian inference and the data augmentation algorithm. A probability model for the observed data is specified given the vector of unknown parameters \(\varvec{\theta } \in \Theta \). Then, \(\theta \) is treated as a random variable with a certain prior distribution, and inference about it is summarized by its posterior, given the data at hand. Rässler (2002) proved that the joint distribution likelihood receives contributions from both the observed data and the prior. Moreover, when data are unobserved, the prior (predictive) distribution does not condition on previous observations. The identification problem is then replaced using prior information and MI procedures. CIA is overcome by MI techniques using informative priors. Different prior settings on the conditional associations allow us to show the sensitivity of the unconditional association, as per the common variables occur in determining it.

The embryonic framework proposed by Rässler (2002) was further extended by Rässler (2003) through the Non-Iterative Bayesian Approach to Statistical Matching (NIBAS). No distributional assumptions are made on \(\varvec{\textrm{x}}\), while the only requirement is that the matching variables can serve as predictor matrices in a linear regression model. Due to the particular structure of missingness characterising SM, it is possible to define both a data model and a prior distribution and, consequently, derive the observed data posterior from them. By means of MI procedures, prior information is used for imputing missing data and, from the imputed data, lower and upper bounds can be estimated to achieve a range of values of the unconditional association parameters. Such a range serves as a quality measure for SM.

Bayesian solutions to the identifiability problem of SM were fundamental for developing the method because they made explicit that whenever two variables are not (or, better, they are never) jointly observed, the related conditional association parameters cannot be estimated by likelihood inference. In contrast, the nearest neighbour solutions proposed and applied over the years are often undermined by the fact that conditional independence is produced de facto, even if it is not assumed. To overcome this drawback and its consequences, NIBAS assumes (at least) univariate normality for Y and Z, while \(f(\varvec{\textrm{y}},\varvec{\textrm{z}} \vert \varvec{\textrm{x}})\) is assumed to be multivariate normal. Then, NIBAS assumes independence between the regression parameters of the general linear models for data sets A and B and the covariance matrix \(\varvec{\Sigma }_{\varvec{\textrm{y}},\varvec{\textrm{z}}\vert \varvec{\textrm{x}}}\) and, with a suitable non-informative prior, the observed-data posterior distribution and the conditional predictive distributions can be derived. From the latter, random draws for the parameters as well as the imputed \(\varvec{\textrm{y}}\) and \(\varvec{\textrm{z}}\) can be obtained.

Aiming to evaluate the predictive power of the matching variables \(\varvec{\textrm{x}}\), by employing simulations, Rässler (2004) demonstrated that the Bayesian approach offers a relevant advantage: whereas regression imputation ends up with estimates of the true population correlation that are not unbiased (not even asymptotically), Bayesian SM allows us to preserve the prior values of the conditional correlation, outperforming all the other approaches. This hints at the fact that, when auxiliary data are at disposal and prior information must be used, the Bayesian multiple imputation procedure proposed by the author is the best choice at hand.

5 Uncertainty: old issue, new challenges

5.1 A matter of constraints

The first half of the ’00s saw the rise of the ‘third way’ to solve the identification problemFootnote 7 in SM. Usually, this had been addressed by means of specific modelling on (YZ) or, auxiliary information on f(YZ). Different approaches to the problem were named ‘uncertainty analysis’, ‘partial identification’, or ‘lower and upper probabilities study’. All these contributed to raise knowledge about the fact that the main goal of SM (at least from a macro point of view) had to be the estimation of the range of potential values identifying the unidentifiable parameters, consistently with the estimable ones (Di Zio and Vantaggi 2017). In other words, the focus must have been the reduction of the uncertainty about the association parameters of the variables never jointly observed, by means of the common variables.

D’Orazio et al. (2006a) state that, even if there could be complete knowledge of the distributions f(XY) and f(XZ), it is not possible to conclude anything about f(XYZ) merely due to the fact that the joint distribution can be predicted only if there is a deterministic relationship between the two bivariate distributions. Considering the cross-tabulation approach in relation to the variables X, Y, and Z (a rather common practice of the ’70s), let \((X^\star , Y^\star , Z^\star )\) be a triplet with number of categories F, G, H, respectively, such that the table’s cells are definable as \(\iota = \{(f, g, h): f = 1, \ldots , F; g = 1, \ldots , G; h = 1, \ldots , H\}\). Therefore, the joint distribution \(f(X^\star , Y^\star , Z^\star )\) is multinomial, unknown, and defined by \(\theta _{fgh} = \textrm{P}(X = f, Y = g, Z = h)\) for \(f = 1, \ldots , F\), \(g = 1, \ldots , G\), \(h = 1, \ldots , H\). Thus, the true, unknown vector of parameters \(\varvec{\theta }^*_{fgh}\) define the distribution. D’Orazio et al. (2006a) state that this vector is totally uncertain but, by assuming complete knowledge on the marginal distributions of the pairs \((X^\star , Y^\star )\), \((X^\star , Z^\star )\), it can be restricted. The parameter \({\theta }^*_{fgh}\) lies in the interval defined by lower and upper limits such that all the plausible values for it determine a density function (made by the frequencies of all \(\varvec{{\theta }}^*_{fgh}\)). By resorting to the MLEs for \({\hat{\theta }}_{fg.}\) and \({\hat{\theta }}_{f.h}\), it can be proved that suitable constraints help ruling out illogical values from \(\Theta \).

By approaching the problem from the point of view of the ecological inference, Conti et al. (2013) estimate the joint distribution of ordered categorical variables \(f(X^\star , Y^\star , Z^\star )\) starting from a contingency table where the population counts provide the marginals. If the rows and columns counts arranging the table come from different samples (A and B), the problem is purely how to estimate the joint distribution function, i.e., a macro SM issue. The proposed solution is to estimate a class of possible distributions for \((X^\star , Y^\star , Z^\star )\), identifying a measure of uncertainty for the estimated model. The uncertainty is defined using the upper and lower bounds of the cells counts, with conditional and unconditional measures of uncertainty eventually restrained by means of structural zeros constraints. In SM, these are frequently used constraints for the parameters when the r.v.s of interest are categorical. Such constraints consist of defining \(\theta _{fgh} = 0\) for some (fgh) (Agresti 2013). A structural zero occurs when (1) at least one pair of categories in (fgh) is not compatible or, (2) each pair in (fgh) is plausible but the triplet is not compatible. Such a constraint is useful for integration purposes since its main effect is to potentially reduce the likelihood ridge to a unique distribution. How? When the goal is to restrict \(\Theta \) to a subspace \(\Omega \subset \Theta \) (closed and convex), we have to find the set of \(\varvec{{\theta }} \in \Omega \) such that the likelihood function \(L(\varvec{{\theta }} \vert \textrm{A} \cup \textrm{B})\) is maximized. Having \({\mathcal {P}}\) parameter subsets, when the case is that \(\Omega \bigcap {\mathcal {P}}_{ \hat{\varvec{\theta }}_{\varvec{\textrm{x}}} \hat{\varvec{\theta }}_{\varvec{\textrm{y}} \vert \varvec{\textrm{x}}} \hat{\varvec{\theta }}_{\varvec{\textrm{z}} \vert \varvec{\textrm{x}}} } \ne \emptyset \), i.e., the subspace has a non-empty intersection with the unconstrained likelihood ridge, structural zeros can be so informative that, e.g., \(\Omega \bigcap {\mathcal {P}}_{ \hat{\varvec{\theta }}_{\varvec{\textrm{x}}} \hat{\varvec{\theta }}_{\varvec{\textrm{y}} \vert \varvec{\textrm{x}}} \hat{\varvec{\theta }}_{\varvec{\textrm{z}} \vert \varvec{\textrm{x}}} } = \hat{\varvec{\theta }} \). For example, by defining \((G - 1)(H - 1)\) independent structural zero constraints for each \(X = f, f = 1, \ldots , F\) is sufficient for a unique ML estimate. In such a context, the simulation study results of Conti et al. (2013) show that the uncertainty reduction is directly proportional to the reduction of the support of the conditional distribution of \(Y^\star \) and \(Z^\star \) given \(X^\star \). In addition, the uncertainty largely depends on the informativeness of the structural zero constraint.

The class of possible distributions for \((X^\star , Y^\star , Z^\star )\) (being these variables categorical, but Conti et al., 2017b considered continuous Z and Y, and discrete X, as discussed later here) is estimable by means of the so-called Fréchet bounds (or ‘uncertainty class’) (D’Orazio et al. 2017). Indeed, the latter allows us identifying the plausible lower and upper bounds for the parameters which must be estimated to define the marginal distributions (\(Z \vert X\)) and (\(Y \vert X\)). Namely, the cell frequencies \(\theta _{yz}\) of the (YZ) contingency table, given the estimates \({\hat{\theta }}_{y \vert x}\) from A, \({\hat{\theta }}_{z \vert x}\) from B, and \({\hat{\theta }}_{x}\) from \(A \cup B\) can be obtained by means of the class identified by

$$\begin{aligned} \text {max}\{0; {\hat{\theta }}_{y \vert x} + {\hat{\theta }}_{z \vert x} - 1\} \le {\hat{\theta }}_{yz \vert x} \le \text {min}\{{\hat{\theta }}_{y \vert x}; {\hat{\theta }}_{z \vert x}\}. \end{aligned}$$
(1)

By means of this uncertainty class, we can evaluate the uncertainty in SM but we can, also, proceed in validating the whole integration. Indeed, Rässler (2002) proposed to evaluate the length of such class for the unidentifiable parameters in the normal multivariate case to finally define a measure of the reliability of the estimates under CIA. The author’s results hint at the fact that, when short uncertainty classes do hold, the parameter estimates obtained by different models slightly differ from the ones obtained under CIA. In addition, a measure of uncertainty is defined by Rässler (2002) as \(\frac{1}{K} \sum {\hat{\theta }}^{(\textrm{U})}_k - {\hat{\theta }}^{(\textrm{L})}_k\), where \(\theta _k\) with \(k = 1, \ldots , K\) are the unidentifiable parameters in a parametric model for (XYZ), while \({\hat{\theta }}^{(\textrm{U})}_k\), \({\hat{\theta }}^{(\textrm{L})}_k\) are the estimated upper and lower bounds of the uncertainty class defined on these parameters.

The intuition was further developed by Conti et al. (2017b) who proposed a measure of uncertainty and studied its properties in the specific non-parametric context. From the parametric point of view, SM uncertainty is quantifiable in terms of the estimates range of the unidentifiable parameters. In contrast, non-parametrically speaking, such a measure relates to the ‘intrinsic’ association between the pair of variables (YZ). By using the Fréchet bounds as a starting point, a measure of uncertainty is given by the suitable functional that quantifies the length of the uncertainty class. Let \(d F(x, y, z) = dQ(x) \ dS(y, z \vert x)\) be the joint distribution function of three r.v.s (XYZ), where Q(x) is the marginal distribution function of X, while \(S(y, z \vert x)\) is the distribution function of (YZ) given X. The latter is a discrete matching variable, while Y and Z are continuous. Conditionally on X, the set of plausible models (or, in other words, the Fréchet class) of all distribution functions \(S(y, z \vert x)\) can be obtained, in such a way that it is compatible with the univariate distribution functions \(G(y \vert x)\) and \(H(z \vert x)\). Let us consider \(\textrm{L}\) and \(\textrm{U}\), the lower and upper bounds where \(\text {L} \left[ G(y \vert x), H(z \vert x) \right] = \text {max} \left[ H(z \vert x) + G(y \vert x) - 1, \ 0 \right] \) and \(\text {U} \left[ G(y \vert x), H(z \vert x) \right] = \text {min} \left[ H(z \vert x), G(y \vert x) \right] \) (defined analogously in Eq. 1). Then, let us consider, for every (yz), the inequalities \(\text {L} \left[ G(y \vert x), H(z \vert x) \right] \le S(y, z \vert x) \le \text {U} \left[ G(y \vert x), H(z \vert x) \right] \). If this pair of inequalities does hold, where the bounds \(\textrm{L}\) and \(\textrm{U}\) are joint distributions functions with margins \(G(y \vert x)\) and \(H(z \vert x)\), the Fréchet class of these two distributions is defined as follows

$$\begin{aligned} {\mathcal {S}} = \left\{ S(y, z \vert x): \text {L} \left[ G(y \vert x), H(z \vert x) \right] \le S(y, z \vert x) \le \text {U} \left[ G(y \vert x), H(z \vert x) \right] \right\} . \end{aligned}$$
(2)

Therefore, the set of distribution functions \({\mathcal {S}}\) defines the uncertainty class in the non-parametric SM framework. Taking the expectation with respect to the distribution of X, the unconditional Fréchet class can be defined as

$$\begin{aligned} \begin{aligned} {\mathcal {S}} = \left\{ S(y, z): E \left[ \text {L} \left( G(y \vert x), H(z \vert x) \right) \right] \le S(y, z) \le E \left[ \text {U} \left( G(y \vert x), H(z \vert x) \right) \right] \right\} . \end{aligned} \end{aligned}$$
(3)

Clearly, the uncertainty class in Eq. 3 does not take advantage of the common variables observed between A and B.

Being each category of X observed in A and B, the estimator of the Fréchet class can be obtained by re-writing Eq. 1 as follows

$$\begin{aligned} \left\{ \text {max} \left[ {\hat{H}}_{n_{\text {B}}}(z \vert x) + {\hat{G}}_{n_{\text {A}}} (y \vert x) - 1, 0 \right] , \ \text {min} \left[ {\hat{H}}_{n_{\text {B}}}(z \vert x), {\hat{G}}_{n_{\text {A}}} (y \vert x) \right] \right\} . \end{aligned}$$
(4)

In addition, the unconditional Fréchet bounds are estimated by

$$\begin{aligned} \begin{aligned} \bigg \{ \sum _{x} {\hat{p}}(x) \ \text {max} \left[ {\hat{H}}_{n_{\text {B}}}(z \vert x) + {\hat{G}}_{n_{\text {A}}} (y \vert x) - 1, \ 0 \right] , \\ \sum _{x} {\hat{p}}(x) \ \text {min} \left[ {\hat{H}}_{n_{\text {B}}}(z \vert x), {\hat{G}}_{n_{\text {A}}} (y \vert x) \right] \bigg \}, \end{aligned} \end{aligned}$$
(5)

where \({\hat{p}}(x) = \left( \frac{n_{\textrm{A},x} + n_{\textrm{B},x}}{n_{\textrm{A}} + n_{\textrm{B}}} \right) \) is an estimate of \(\textrm{P}(X = x)\),Footnote 8

Conti et al. (2017b) built a confidence region for the estimator of the Fréchet class depicted in Eq. 4 and, from it, by exploiting the Kolmogorov–Smirnov (KS) statistic, they set the confidence bands for \(G(y \vert x)\) and \(H(z \vert x)\), which are given by

$$\begin{aligned} {\mathcal {G}}_{n_{\textrm{A}}, x}= & {} \left( {\hat{G}}_{n_{\textrm{A}}} (y \vert x) - \frac{k_{\alpha }}{\sqrt{{n_{\textrm{A}, x}}}}, \ {\hat{G}}_{n_{\textrm{A}}} (y \vert x) + \frac{k_{\alpha }}{\sqrt{{n_{\textrm{A}, x}}}}; y \in {\mathbb {R}} \right) , \nonumber \\ {\mathcal {H}}_{n_{\textrm{B}}, x}= & {} \left( {\hat{H}}_{n_{\textrm{B}}} (z \vert x) - \frac{k_{\alpha }}{\sqrt{{n_{\textrm{B}, x}}}}, \ {\hat{H}}_{n_{\textrm{B}}} (z \vert x) + \frac{k_{\alpha }}{\sqrt{{n_{\textrm{B}, x}}}}; z \in {\mathbb {R}} \right) , \end{aligned}$$
(6)

respectively, with \(k_{\alpha }\) being the \(1-\alpha \) quantile of the KS distribution. By defining

$$\begin{aligned} \mathcal {\hat{{\underline{S}}}} (y, z \vert x)= & {} \text {max} \left\{ {\hat{H}}_{n_{\textrm{B}}} (z \vert x) - \frac{k_{\alpha }}{\sqrt{{n_{\textrm{B}, x}}}} + {\hat{G}}_{n_{\textrm{A}}} (y \vert x) - \frac{k_{\alpha }}{\sqrt{{n_{\textrm{A}, x}}}} -1, \ 0 \right\} , \nonumber \\ \mathcal {\hat{{\overline{S}}}} (y, z \vert x)= & {} \text {min} \left\{ {\hat{H}}_{n_{\textrm{B}}} (z \vert x) + \frac{k_{\alpha }}{\sqrt{{n_{\textrm{B}, x}}}}, \ {\hat{G}}_{n_{\textrm{A}}} (y \vert x) + \frac{k_{\alpha }}{\sqrt{{n_{\textrm{A}, x}}}} \right\} , \end{aligned}$$
(7)

we end up with a confidence region for the Fréchet class given by \({\mathcal {S}}^x_n = \left\{ S(y, z \vert x): \mathcal {\hat{{\underline{S}}}} (y, z \vert x) \le {\mathcal {S}} (y, z \vert x) \le \mathcal {\hat{{\overline{S}}}} (y, z \vert x) \right\} \). The measure of pointwise uncertainty is given by the interval \(\left\{ \text {L} \left[ G(y \vert x), H(z \vert x) \right] , \ \text {U} \left[ G(y \vert x), H(z \vert x) \right] \right\} \) in terms of its length (i.e., \(\text {U} - \text {L}\)). Also, Conti et al. (2017b) summarized the pointwise measures of uncertainty (due to the fact that we have one measure for every triple xyz) into the unique measure of average length. Indeed, they define a weight function on \({\mathbb {R}}^3\), T(xyz), and compute \( \int _{{\mathbb {R}}^3}^{} \left\{ \text {U} \left[ G(y \vert x), H(z \vert x) \right] - \text {L} \left[ G(y \vert x), H(z \vert x) \right] \right\} dT(x, y, z) \). Therefore, taking \( dT(x, y, z) = dQ(x) \ d\left[ G(y \vert x), H(z \vert x) \right] \), the overall measure is given by

$$\begin{aligned} \begin{aligned} \int _{{\mathbb {R}}}^{} \Biggl \{ \int _{{\mathbb {R}}^2}^{} \left\{ \text {U} \left[ G(y \vert x), H(z \vert x) \right] - \text {L} \left[ G(y \vert x), H(z \vert x) \right] \right\} d \left[ G(y \vert x), H(z \vert x) \right] \Biggr \} dQ(x). \end{aligned} \end{aligned}$$

The main finding related to such an uncertainty measure is that this intrinsic uncertainty (when no external auxiliary information is available) does not depend on neither the support nor the marginal distribution of f(YZ), i.e., \(G(y \vert x)\), \(H(z \vert x)\). Indeed, independently of the sample data, Conti et al. (2017b) proved that the maximal uncertainty is 1/6. In contrast, such an uncertainty can be reduced when auxiliary information is available by imposing logical constraints.

Establishing boundaries for the illogical occurrences in the set of parameters finds practical relevance in real data applications. There are at least two reliable cases in which, either the existence of some information is doubtable, or it is marked by inequality. For example, the occurrence of an eight years old employee clearly belongs to the first case. For the second case, let’s consider the probability of being a casual worker with a diploma to be higher than the probability of a manager without a degree. If “logical constraints naturally arise from applications” (Vantaggi 2008, p.710), constraints must be properly managed. Indeed, by re-adapting the probability theory of de Finetti (1974) and Vantaggi (2008) proposed to exploit coherent conditional probability for combining data from different sources without necessarily uptake strong assumptions on the relationships of (XYZ). Furthermore, logical constraints can be considered but, when they are not present, the author proves that the conditional assessment can still be coherent even if we have to assume conditional independence. To date, in the case of same population and, same sample scheme, the proposal of Vantaggi (2008) is the only one that exploits the coherent conditional probability for integrating data (she considered both the case of two sources and multiple ones).

While Vantaggi (2008) proposed a setting for incoherences reduction based on MLEs, further developments towards the reduction of incoherences based on distance minimization are proposed by Brozzi et al. (2012). They suggested using specific adjustments which, by targeting weighted localization of parameters sub-domains from which the incoherences must be removed, prove to perform better than the originally coherent assessment of Vantaggi (2008).

A peculiar issue is tackled by Di Zio and Vantaggi (2017) in relation to the partial identification problem when the matching variables are misclassified. By disregarding the effect of the sampling variability of the estimates on the identification regions, the authors evaluate different scenarios of misclassification. By dealing with categorical variables, they describe the partially identifiable region (i.e., the class of probabilities which extend the conditional probabilities obtained by the information available in different sources) by means of lower and upper bounds on the consistent probabilities. When the common variable(s) is(are) misreported in only one of the two data sets that the researchers want to match, the authors demonstrate that the potential consistency of the distributions increases due to assumptions on the misclassification mechanism. In other words, it is possible to refine the identifiable region by means of such assumptions about the matching variables misclassification.

How much the integration uncertainty affects the quality of the synthetic (complete) data set is investigated by Conti et al. (2016, 2017a) by taking into account a stratified sampling design. In the first work, the authors propose a specific measure of the ‘error’ introduced by matching, estimating the distribution function for the variables not jointly observed as well as the corresponding measure of error (upper bounds of which are also introduced). If a class of plausible distributions for (XYZ), conditional or unconditional on the matching variable, can be identified, the size of this class defines the measure of uncertainty. The authors prove that the difference between the admissible distributions in such class (that is a constrained Fréchet class) and the chosen matching distribution, basically gives the error of the matching procedure. The latter is estimable using iterative proportional fitting, offering the maximal error that can occur in choosing a distribution from such a class, i.e., by drawing a surrogate of the true but unknown f(XYZ).

5.2 Sampling frameworks and Big Data

Mainly, the different integration approaches discussed so far considered probability samples. Nevertheless, due to (1) the increasing rates of non-response, (2) the actual costs for data collection and, (3) the potentialities offered by Big Data, the trade-off between data quality and resources needed hinted at investigating other opportunities, e.g., non-probability samples which represent, to date, the most profitable solution for data integration (Lohr and Raghunathan 2017). Relevant examples are web surveys, social media data, mobile phone records, and web crawling software data. Rivers (2007) considered web surveys data proposing a nearest neighbour mass imputation approach that trains a predictive model of Y given X on the non-probability sample (e.g., a web panel) and uses it to predict the distribution \(f(Y \vert X)\) for the probability sample, i.e., a conventional random sample from a population frame. This idea aims to tackle the non-response problems in probability-based surveys: individuals selected from the sampling frame (that covers the target population and contains some auxiliary variables) do not have to directly answer the questionnaire. Instead, they are allocated to a panel (that, also, contains the aforementioned set of auxiliary variables) that mimics the selected people who are then asked to complete the questionnaire. This sort of ‘sample matching’ is further investigated by Bethlehem (2016) who explores the conditions under which it works in the most efficient way. The author points out that such an imputation approach depends on the capacity of the auxiliary variables in explaining the participation behaviour completely: the non-response bias removal is higher as far as such capacity holds.

Alternatively, the ‘propensity to respond’ as a function of the covariates \(\varvec{\textrm{x}}\) for the non-probability sample is estimated and thus used to weight the non-probability data. By adapting the approach of Lee (2006) and Castro-Martín et al. (2022) estimate the individual propensity to participate in the non-probability sample by considering the hypothetical scenario of how would the sample have been if a probability sampling design was used to draw it. Selection bias reduction benefits from the training method proposed by the authors, offering “more importance in the prediction to the individuals who are more likely to appear in the population” (Castro-Martín et al. 2022, p. 17). Residual limitations lie in the fact that wider replicability of the results is envisaged (different data sets, more scenarios, etc.), additional prediction algorithms could be considered, and theoretical properties must be further developed.

A relevant challenge is represented by the fact that by combining Big Data from different sources (e.g., by incorporating large survey data) the promising matching-based imputation is essentially based on the MAR assumption. For example, this happens in Chen et al. (2020a) who propose a weighting adjustment based on parametric model assumptions on the selection mechanism of the non-probability sample, further extended by Chen et al. (2020b) to the non-parametric framework. Kim et al. (2021) go beyond MAR by proposing a sampling mechanism for Big Data that allows us to consider systematic differences among the samples even after having adjusted for the covariates. The probability sample is then used to estimate the missing data, correcting for the under-coverage bias of Big Data (that is considered an incomplete sampling frame for the finite population).

6 Conclusions (and the world beyond)

Recently, Statistical Matching has been (re)gaining attention within the OS, due to the role played by Big Data but, also, because of the increasing necessity of data providers for producing more detailed and punctual information, at the quickest time (de Waal 2015). If Multiple Imputation can be used when the missing information is partially present in a single data set, while probabilistic Record Linkage deals with the absence/misreporting of unique identifiers for the units observed in different files (which, in turn, must not be subjected to incompleteness), statistical matching (that is closely related to these methods) offers the possibility to deal with variables that are never jointly observed in two or more data sets. This feature is a strength for addressing many practical challenges in a world that is more and more characterised by several potential sources and tools for data collection and information sharing.

A relevant proposal aiming to shrink the gap between RL and SM is offered by Gessendorfer et al. (2018) who use SM as a supplement for RL when, dealing with non-consenter individuals observed in e.g., ad hoc surveys, the information collected on them in the administrative data cannot be aggregated with that of the surveys. In such a peculiar context of missing information, the proposal is to use SM to provide the values of the variables for the individuals who refused to give their consent for the linkage and, hence, for integrating the information that could not be integrated otherwise. However, the authors stressed that matching the non-consenter individuals previous to linking the observations does provide conflicting results hinting at problems which can be potentially worse than just ignoring the lack of consent.

Considering high-dimensional problems, Ahfock et al. (2016) deal with multivariate \((\varvec{\textrm{x}}, \varvec{\textrm{y}}, \varvec{\textrm{z}})\) aiming to identify the parameters characterizing the joint distribution function \(f(\varvec{\textrm{x}}, \varvec{\textrm{y}}, \varvec{\textrm{z}})\). They propose to draw values from the identified set of parameters, such that the range of sampled values offers a measure of uncertainty of the partially identified parameters (i.e., the ones requiring a joint observation of \(\varvec{\textrm{y}}\) and \(\varvec{\textrm{z}}\)). The solution proposed consists of a Gibbs sampler-based approach for estimating a set of positive-defined completions of a partially specified covariance matrix and it is a generalizable exit strategy for real-world data problems involving multivariate normal, skewed-normal, and normal mixture models. Comparing the results from both a simulation study and real data with those generated by a Bayesian approach, Ahfock et al. (2016) offer proofs that their frequentist sampling method largely outperforms the Bayesian one in providing correlation estimates in the neighbourhood of the true, observed one. In addition, the method shows flexibility and remarkable computational speed.

Beyond the role potentially played by Big Data in the integration context, the near future of SM is linked to different theoretical challenges. While parametric SM has been extensively analysed, non-parametric SM left unsolved some challenges which are related, for example, to the optimality of the distance function to be used with distance-based hot deck techniques, or to the ‘size’ of the donation classes, and the discussion on how much these elements may affect the variance estimates in the synthetic (complete) data set.

Another pending issue is related to the use of survey weights and the sampling design used to build the different data sets at hand. Marella and Pfeffermann (2019) recently proposed a solution for combining the information when this is collected by different samples. Under informative sampling designs, the uncertainty of SM results is compared to the one generated by matching under a ‘blind’ CIA, or, in other words, by ignoring the informative sampling mechanisms. The simulation study proves that ignoring the sample selection process and its effects, the predictions on X, as well as those on Y and Z are negatively impacted. Hence, the synthetic (complete) data set generated differs from the underlying population distribution of f(XYZ), thus producing bias. This can happen even if the estimates generated by ignoring the sample selection effects may show a smaller variance.

Conti et al. (2019) stress that SM applications are not very common because data are obtained by means of different complex survey designs which, in turn, prevent the straightforward reconciliation of information. However, the authors suggest that, if ecological inference made effective the drawing of conclusions at the individual level starting from aggregated data, SM, also, could be re-directed for drawing inference by using the matching variables which can be thought of as a sort of ‘grouping’. Considering that no solution has been shared between these two fields, the authors hint at exploring this possibility. In addition, further developments could target the hidden incompleteness of the measure of uncertainty that the authors proposed. Indeed, if their proposal is very useful to capture the ‘uncertainty of the data sets’, it somehow lacks of measuring the ‘quality of the matching’. An indicator that measures such a SM quality may be relevant for future research.

The inclusion of the sampling weights in the matching procedure and, more generally, the considerations related to the characteristics of the samples to be aggregated is of particular relevance since only two other works treated such challenges: Rubin (1986) and Renssen (1998). The former proposed to compute new sampling weights from the units produced by the A \(\cup \) B supersample. This idea found scarce applicability, in practice, because the inclusion probabilities in the A sample, under the sampling design of B are not known. The latter proposed to calibrate the actual weights of the distinct A and B samples to the common information and, hence, obtain distributions that are compatible with the marginals (YX), (XZ). However, D’Orazio (2009) demonstrated that the two proposals lead to very similar results.

Practically speaking, the main future improvement to take into account is related to auxiliary information. Which kind of auxiliary variables have to be used, in the most efficient way, for obtaining sufficiently accurate statistics from the integrated data? Which kind of information exploited from an additional source can be more proficiently used in integrating data? The knowledge about population totals or the knowledge about the relationship(s) of the variables at hand? Furthermore, would it be beneficial to recursively conduct ad hoc surveys to obtain information on a subset of the variables of interest from different data sources? How, then, would be possible to assess the quality of the inference drawn from the integrated data, when the latter is not available in complete form, in practice? In other words, how to assess this quality when we are the users carrying out the integration? These questions go with the need for additional simulation studies that investigate different parameters specification and dependence structures behind the imputation performances. Moreover, real-data applications should be addressed to set up straightforward data quality criteria for the matched data sets.