Abstract
Undercoverage and nonresponse problems are jointly present in most socioeconomic surveys. The purpose of this paper is to propose an estimation strategy that accounts for both problems by performing a twostep calibration. The first calibration exploits a set of auxiliary variables only available for the units in the sampled population to account for nonresponse. The second calibration exploits a different set of auxiliary variables available for the whole population, to account for undercoverage. The two calibrations are then unified in a doublecalibration estimator. Mean and variance of the estimator are derived up to the first order of approximation. Conditions ensuring approximate unbiasedness are derived and discussed. The strategy is empirically checked by a simulation study performed on a set of artificial populations. A case study is derived from the European Union Statistics on Income and Living Conditions survey data. The strategy proposed is flexible and suitable in most situations in which both undercoverage and nonresponse are present.
1 Introduction
Särndal et al. (1992, p. 8) establish four requirements to select a probability sample, setting the perimeter for the definition of a sampling design under the randomization principle. One requirement is that the procedure to select the sample ensure invariably positive probabilities to enter the sample for all units in the population.
This requirement may not be suitable in some situations such as in establishment surveys, such as the Economic Census conducted by the U.S. Census Bureau, in which the population of businesses is characterized by a highly skewed distribution in the survey variables (Glasser 1962). In this case, different approaches are commonly used, essentially based on the partition of population into strata determined by several business characteristics (e.g. size), and some strata are completely censused, some are sampled, and some are neglected, based on the features of units or the ability to contact them (Sigman and Monsour 1995). As happens in establishment surveys conducted by the U.S. Bureau of Economic Analysis, very small establishments are excluded a priori from the population to be sampled, due to the costs in building and updating a sampling frame, against an expected slight gain in efficiency of the estimators (see e.g. Hidiroglou 1986; De Haan et al. 1999; Rivest 2002). These instances are known in the literature as cutoff sampling (Knaub 2008; Benedetti et al. 2010; Haziza et al. 2010a). A similar position can be seen in social surveys on households, such as the Household Finance and Consumption Survey managed by the European Central Bank, characterized by the missed observation of population units considered ineligible for the survey, i.e. dwellings that are vacant, not habitable, with noneligible members, etc., with consequences on the estimation of living conditions and poverty rate (Nicoletti et al. 2011). In this framework it is worth distinguishing between cutoff sampling, alternatively referred to as planned undercoverage, which is often used in socioeconomic surveys and unplanned undercoverage which is typical in social surveys. In the first case, auxiliary information is available for all the units in the noncovered portion of the population, whereas only population totals are available in the second case (see e.g. Lehtonen and Veijanen 2009). Owing to the aforementioned undercoverage of the whole population, the unadjusted estimator is biased in these situations. Bias is usually corrected in the literature by means of modelbased techniques (see, among others, Kott 2006; Haziza et al. 2010a). Recently, a solution to undercoverage problem has been proposed by Fattorini et al. (2018) in which the properties of the resulting estimator are evaluated in relation to the sampling design while all the population characteristics are held fixed. In particular, the authors propose adopting a calibration technique in which the weights originally attributed to each sample observation are modified in such a way as to be able to estimate the population totals of a set of auxiliary variables without error. The rationale behind calibration is evident: if the calibrated weights guess the population totals of the auxiliary variables without errors, they should also be suitable for estimating the total of the survey variable, providing a relationship exists between the survey variable and the auxiliaries. Obviously, calibration is likely to perform well in terms of precision under a strong linear relationship.
Socioeconomic surveys also involve unit nonresponse, the more so the higher the sensitivity of the survey variables (e.g. sexual behavior, drug consumption, etc.). However undesirable, nonresponse is a natural contingency in surveys, so the damage to estimations and inferences needs to be addressed (Groves and Peytcheva 2008). This is crucial in survey sampling theory and is extensively treated in the literature (e.g. Brick and Montaquila 2009). Extensively applied methods include poststratification (Holt and Smith 1979), response homogeneity groups (Särndal et al. 1992), and, more recently, modelbased techniques including imputation and nonresponse propensity weighting (Särndal and Lundström 2005; Haziza et al. 2010b). In particular, nonresponse propensity weighting assumes that each unit of the sampled population has a strictly positive probability to respond. A model is then used to estimate the probabilities of respondent units from the sample by connecting these probabilities to auxiliary information by means of logistic regression models (Chang and Kott 2008). In addition to this source of uncertainty, the requirement of positive response probability seems to tighten in socioeconomic surveys, because some units will not respond in any situation (e.g. homeless and geographically mobile individuals and families). Alternatively, Fattorini et al. (2013) attempt a designbased solution in which population values and nonresponse are viewed as fixed characteristics. For this purpose, they once again use the calibration technique, defined in the literature as nonresponse calibration weighting by Haziza et al. (2010b). In this case, weights originally attributed to each respondent unit are modified in such a way as to be able to estimate the population totals of a set of auxiliary variables without error.
In most cases undercoverage and nonresponse problems are jointly present in socioeconomic surveys. Therefore, a general indication in the treatment of both problems concerns the use of any available auxiliary information, even if some is not available to all units of the population. In this paper, we build on the availability of a set of auxiliary variables for the whole population while another set is available only for the sampled portion. In establishment surveys, for example, much financial information may be available only for businesses of adequate size, such as corporations, and may not be for small businesses excluded from the sampling, such as microenterprises. Moreover, owing to recent data collection developments, the additional information may derive from big data, e.g. data from internet and telephone use, social networks, online purchases, etc.
The purpose of this paper is to propose doublecalibration estimators. The use of calibration in two or more steps is not new and has already been used, among others, by Folsom and Singh (2000) and Estevao and Särndal (2006). Moreover, it has been routinely adopted by National Statistical Offices for many years. Here we propose an estimation strategy that considers both undercoverage and nonresponse problems, solving them by performing double calibration. The first calibration exploits a set of auxiliary variables available only for the units in the sampled population to account for nonresponse; the second calibration exploits a different set of auxiliary variables available for the whole population, to account for undercoverage. Joining together the two calibrations, we propose a doublecalibration estimator that is applicable to all cases in which both undercoverage and nonresponse problems are present.
The paper is structured as follow. In Sect. 2, some preliminaries and notations are given. Section 3 is devoted to the construction of the doublecalibration estimator and in Sect. 4 some statistical properties (expectation and variance) are derived. In order to check the efficiency of the strategy, in Sect. 5 Monte Carlo simulation studies are performed to explore several scenarios. In Sect. 6, using data from the European Union Statistics on Income and Living Conditions survey and from Statistics Denmark data, a case study to estimate the total income of Danish households in 2013 is presented and discussed. Some concluding remarks are given in Sect. 7.
2 Preliminaries and notation
Denote as \(U=\left\{ u_{1},...,u_{N}\right\}\) a finite population of N units. Let \(y_{j}\), with \(j\in U\), the value for unit j of the survey variable Y. We aim to estimate the population total \(T_{Y}=\sum _{j\in U}y_{j}\). For the whole population there exists a vector \({\varvec{Z}}\) of M auxiliary variables whose values \(\varvec{{\varvec{z}}}_{j}=\left[ z_{j1},...,z_{jM}\right] ^{t}\) are known for each \(j\in U\), in such a way that the vector of totals \({\varvec{T}}_{Z}=\sum _{j\in U}{\varvec{z}}_{j}\) is also known.
In this setting, for one of the reasons mentioned in the introduction, only a subpopulation \(U_{B}\) of size \(N_{B}<N\) units is sampled using a fixedsize design having first and secondorder inclusion probabilities \(\pi _{j},\pi _{jh}\) for any \(h>j\in U_{B}\). Denote by \(T_{Y(B)}=\sum _{j\in U_{B}}y_{j}\) the unknown total of Y in \(U_{B}\). Moreover, suppose that additional information exists in the subpopulation \(U_{B}\). More precisely suppose that there exists a vector \({\varvec{X}}\) of K auxiliary variables whose values \({\varvec{x}}_{j}=\left[ x_{j1},...,x_{jM}\right] ^{t}\) are known for each \(j\in U_{B}\) in such a way that the vector of totals \({\varvec{T}}_{X(B)}=\sum _{j\in U_{B}}{\varvec{x}}_{j}\) is also known. In this setting, denote by \({\varvec{T}}_{Z(B)}=\sum _{j\in U_{B}}\varvec{{\varvec{z}}}_{j}\) the known vector of total of the \({\varvec{z}}_{j}\)s in the subpopulation \(U_{B}\).
A random sample S of \(n<N_{B}\) units is selected from the subpopulation \(U_{B}\) by means of the adopted sampling scheme. As often happens in practice, especially in socioeconomic surveys, the sample may be affected by nonresponses, in such a way that the sample is split into two subsamples, the subsample \(R\subset S\) of the respondent units and the subsample \(SR\) of the nonrespondent units.
The set presented above shows two problems to solve: first, a correction for nonresponses is necessary, in order to estimate \(T_{Y(B)}\); second, since the sample S is selected from \(U_{B}\) and not from U, any \(T_{Y(B)}\) estimator is biased, so a correction is needed in order to estimate \(T_{Y}\). We propose a calibration in two steps, developed in the following subsections.
3 The doublecalibration estimator
3.1 First calibration: from respondent group to sampled subpopulation
The first issue to deal with is the nonresponse problem in a sample. Since S is selected in \(U_{B}\), in the absence of nonresponses, it would be possible to estimate \(T_{Y(B)}\) by means of the wellknown Horvitz–Thompson (HT) estimator
and \({\hat{T}}_{Y(B)}\) would be an unbiased estimator for \(T_{Y(B)}\) if all \(\pi _{j}\) are positive. However, owing to nonresponses, any unadjusted estimator is destined to be a biased estimator of \(T_{Y(B)}\). Following results obtained in Särndal and Lundström (2005), the bias may be reduced by exploiting the \({\varvec{X}}\)vector of auxiliary information. The resulting estimator is
where \(\hat{{\varvec{b}}}_{R}=\hat{{\varvec{A}}}_{R}^{1}\hat{{\varvec{a}}}_{R}\) is the leastsquare coefficient vector of the regression of Y vs \({\varvec{X}},\) performed on the respondent sample R, i.e. \(\hat{{\varvec{A}}}_{R}=\sum _{j\in R}\frac{{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}}{\pi _{j}}\) and \(\hat{{\varvec{a}}}_{R}=\sum _{j\in R}\frac{y_{j}{\varvec{x}}_{j}}{\pi _{j}}\) and the unit constant is tacitly adopted as the first auxiliary variable in the vector \({\varvec{X}}\).
The properties of \({\hat{T}}_{Y(B)cal}\) are derived in Fattorini et al. (2013). The population is partitioned into respondent and nonrespondent strata and the estimator is approximately unbiased if the relationship between Y and \({\varvec{X}}\) is similar in both the strata. Practically speaking, this condition is similar to the one assumed in most modelbased nonresponse treatments (for a discussion, see Haziza and Lesage 2016).
3.2 Second calibration: from sampled subpopulation to the whole population
Because \({\hat{T}}_{Y(B)cal}\) is, at most, an approximately unbiased estimator of \(T_{Y(B)}\), it is a biased estimator of \(T_{Y}\). Indeed, the sampling scheme adopted to select S generates a sampling design onto \(U_{B}\) but not onto U, and units of \(UU_{B}\) cannot enter the sample. Therefore, the missed selection of some population units leads to a bias due to population undercoverage and it is necessary to correct the estimator \({\hat{T}}_{Y(B)cal}\).
Fattorini et al. (2018) called these schemes as pseudo designs and proposed a designbased calibration estimation based on a single auxiliary variable having a proportional relationship with the survey variable. In order to extend this approach to vectors of auxiliary variables and to more general linear relationships, the population undercoverage is handled by the calibration criterion proposed by Särndal and Lundström (2005). Specifically, if the \(y_{j}\)s were available for each \(j\in S\), the information furnished by the M auxiliary variables \({\varvec{Z}}\), available for all the population units, could be exploited by means of the calibration estimator
where \(\hat{{\varvec{d}}_{B}}=\hat{{\varvec{C}}}_{B}^{1}\hat{{\varvec{c}}}_{B}\) is the leastsquare coefficient vector of the regression of Y vs \({\varvec{Z}}\), performed on the whole sample S, i.e. \(\hat{{\varvec{C}}}_{B}=\sum _{j\in S}\frac{\varvec{{\varvec{z}}}_{j}\varvec{{\varvec{z}}}_{j}^{t}}{\pi _{j}}\) and \(\hat{{\varvec{C}}}_{B}=\sum _{j\in S}\frac{y_{j}\varvec{{\varvec{z}}}_{j}}{\pi _{j}}\).
If we suppose once again that the unit constant is adopted as the first auxiliary variable in the vector \({\varvec{Z}}\), then the calibration estimator (3) could be rewritten as
where \(\hat{{\varvec{T}}}_{Z(B)}=\sum _{j\in S}\frac{\varvec{{\varvec{z}}}_{j}}{\pi _{j}}\) is the HT estimator of the totals of the \({\varvec{z}}_{j}\)s in the sampled subpopulation \(U_{B}\) (see Appendix A.1 for the proof).
However, the estimator \({\hat{T}}_{Y(cal)}\) is only virtual, because knowing the values of the survey variable only for the respondent subset R, neither the HT estimator \({\hat{T}}_{Y(B)}\) nor the leastsquares coefficient vector \(\hat{{\varvec{d}}_{B}}=\hat{{\varvec{C}}}_{B}^{1}\hat{{\varvec{c}}}_{B}\) are known. Therefore, exploiting Eq. (4), a double calibration estimator can be constructed by using \({\hat{T}}_{Y(B)cal}\) instead of \({\hat{T}}_{Y(B)}\) and \(\hat{{\varvec{d}}_{R}}=\hat{{\varvec{C}}}_{R}^{1}\hat{{\varvec{c}}}_{R}\), instead of \(\hat{{\varvec{d}}_{B}}\) where \(\hat{{\varvec{C}}}_{R}=\sum _{j\in R}\frac{\varvec{{\varvec{z}}}_{j}\varvec{{\varvec{z}}}_{j}^{t}}{\pi _{j}}\) and \(\hat{{\varvec{C}}}_{R}=\sum _{j\in R}\frac{y_{j}\varvec{{\varvec{z}}}_{j}}{\pi _{j}}\). Practically speaking, the resulting estimator of the whole population total turns out to be
With the double calibration estimator, the information provided by \({\varvec{X}}\) and \({\varvec{Z}}\) is exploited to handle both nonresponses and population undercoverage.
4 Statistical properties of the double calibration estimator
Denote by \(U_{B(R)}\) the stratum of respondent units in the subpopulation \(U_{B}\) and by \(U_{B(NR)}\) the stratum of nonrespondent units. As suggested by Fattorini et al. (2013), introduce a dummy variable as \(r_{j}=1\) if \(j\in U_{B(R)}\) and \(r_{j}=0\) if \(j\in U_{B(NR)}\). Therefore, using the \(r_{j}\)s indicators \(\hat{{\varvec{A}}}_{R}\), \(\hat{{\varvec{a}}}_{R}\), \(\hat{{\varvec{C}}}_{R}\) and \(\hat{{\varvec{c}}}_{R}\) can be rewritten as \(\hat{{\varvec{A}}}_{R}=\sum _{j\in S}\frac{r_{j}{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}}{\pi _{j}}\), \(\hat{{\varvec{a}}}_{R}=\sum _{j\in S}\frac{r_{j}y_{j}\varvec{{\varvec{x}}}_{j}}{\pi _{j}}\),\(\hat{{\varvec{C}}}_{R}=\sum _{j\in S}\frac{r_{j}\varvec{{\varvec{z}}}_{j}\varvec{{\varvec{z}}}_{j}^{t}}{\pi _{j}}\) and \(\hat{{\varvec{C}}}_{R}=\sum _{j\in S}\frac{r_{j}y_{j}\varvec{{\varvec{z}}}_{j}}{\pi _{j}}\). Therefore, the previous matrices and vectors as well as the double calibration estimator \({\hat{T}}_{Y(dcal)}\) depend on the selection of the sole sample S, while nonresponses are accounted for in the \(r_{j}\)s, which are a fixed characteristic of the population.
It is worth noting that in this perspective, \(\hat{{\varvec{A}}}_{R}\), \(\hat{{\varvec{a}}}_{R}\), \(\hat{{\varvec{C}}}_{R}\), \(\hat{{\varvec{c}}}_{R}\) and \(\hat{{\varvec{T}}}_{Z(B)}\) are HT estimators of \({\varvec{A}}_{R}=\sum _{j\in U_{B}}r_{j}{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}=\sum _{j\in U_{B(R)}}{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}\), \({\varvec{a}}_{R}=\sum _{j\in U_{B}}r_{j}y_{j}{\varvec{x}}_{j}=\sum _{j\in U_{B(R)}}y_{j}{\varvec{x}}_{j}\), \({\varvec{C}}_{R}=\sum _{j\in U_{B}}r_{j}{\varvec{z}}_{j}{\varvec{z}}_{j}^{t}=\sum _{j\in U_{B(R)}}{\varvec{z}}_{j}{\varvec{z}}_{j}^{t}\), \({\varvec{c}}_{R}=\sum _{j\in U_{B}}r_{j}y_{j}{\varvec{z}}_{j}=\sum _{j\in U_{B(R)}}y_{j}{\varvec{z}}_{j}\) and of \({\varvec{T}}_{Z(B)}\), respectively. Therefore, because \({\hat{T}}_{Y(dcal)}\) is differentiable with respect to \(\hat{{\varvec{A}}}_{R}\), \(\hat{{\varvec{a}}}_{R}\), \(\hat{{\varvec{C}}}_{R}\), \(\hat{{\varvec{c}}}_{R}\) and \(\hat{{\varvec{T}}}_{Z(B)}\), it can be approximated up to the first term by a Taylor series around the true population counterparts \({\varvec{A}}_{R}\), \({\varvec{a}}_{R}\), \({\varvec{C}}_{R}\), \({\varvec{c}}_{R}\) and \({\varvec{T}}_{Z(B)}\). The equation of the firstorder Taylor series approximation of \({\hat{T}}_{Y(dcal)}\) is derived in Appendix A.2.
4.1 Approximate expectation
From the firstorder Taylor series approximation of \({\hat{T}}_{Y(dcal)}\) it immediately follows that
where \({\varvec{b}}_{R}={\varvec{A}}_{R}^{1}{\varvec{a}}_{R}\) is the leastsquare coefficient vector of the regression of Y vs \({\varvec{X}}\) performed on the respondent stratum \(U_{B(R)}\) and \({\varvec{d}}_{R}={\varvec{C}}_{R}^{1}{\varvec{c}}_{R}\) is the leastsquare coefficient vector of the regression of Y vs \({\varvec{Z}}\) performed in the same stratum. Exploiting equation (6), after some algebra shown in Appendix A.3, proves that the double calibration estimator is unbiased up to the firstorder approximation if:

1.
the linear relationship between Y and \({\varvec{X}}\) is similar in the respondent and nonrespondent strata of \(U_{B}\), i.e. \({\varvec{b}}_{R}\approx {\varvec{b}}_{NR}\), where \({\varvec{b}}_{NR}\) is the leastsquare coefficient vector of the regression of Y vs \({\varvec{X}}\) performed on the nonrespondent stratum \(U_{B(NR)};\)

2.
the linear relationship between Y and \({\varvec{Z}}\) is similar in the respondent stratum and in the whole subpopulation \(U_{B}\), i.e. \({\varvec{d}}_{R}\approx {\varvec{d}}_{B}\), where \({\varvec{d}}_{B}\) is the leastsquare coefficient vector of the regression of Y vs \({\varvec{Z}}\) performed on the whole subpopulation \(U_{B}\);

3.
the linear relationship between Y and \({\varvec{Z}}\) is similar in the two subpopulations \(U_{B}\) and \(UU_{B}\), i.e. \({\varvec{d}}_{B}\approx {\varvec{d}}_{NB}\), where \({\varvec{d}}_{NB}\) is the leastsquare coefficient vector of the regression of Y vs \({\varvec{Z}}\) performed on the whole subpopulation \(UU_{B}\).
It is worth noting that the approximate expectation in Eq. (6) does not depend on the design (e.g., first and second order inclusion probabilities), but only on the population characteristics. Therefore, under conditions 1–3, designunbiasedness holds irrespective of the sampling design adopted.
4.2 Approximate variance and variance estimation
From equation (A.3) of Appendix A.2, the firstorder Taylor series approximation of \({\hat{T}}_{Y(dcal)}\) is rewritten as a translation of an HT estimator, in the sense that
where
are the influence values (e.g. Davison and Hinkley 1997).
Therefore, the approximate variance of \({\hat{T}}_{Y(dcal)}\) turns out to be (e.g. Särndal et al. 1992, p. 175)
On the basis of Eq. (7), the wellknown Sen–Yates–Grundy (SYG) variance estimator is given by
where
are the empirical influence values computed for each sample unit.
5 Simulation study
Simulations were used to check the performance of the proposed estimator. We considered a population U of \(N=10000\) units and a subpopulation \(U_{B}\subset U\) of \(N_{B}=7500\) units. We assumed that the values \(z_{j}\) of an auxiliary variable Z were available for each \(j\in U\) and were adopted for sample undercoverage calibration. Moreover, we assumed that the values \(x_{j}\) of an auxiliary variable X achieved from additional information were available for each \(j\in U_{B}\) and were adopted in nonresponse calibration. We also assumed that the subpopulation \(U_{B}\) was partitioned into respondent and nonrespondent strata \(U_{B(R)}\) and \(U_{B(NR)}\), respectively. Three sizes were assumed for the respondent stratum, \(N_{B(R)}=2250;4500;6750\) units corresponding to response rates of 30%, 60% and 90%, respectively. Moreover, variables were generated respecting some criteria, in order to explore several scenarios, as explained below.
5.1 Unbiasedness of \({\hat{T}}_{Y(dcal)}\)
The auxiliary variables X and Z and the survey variables Y were generated from a trivariate normal distribution. The expectations and variances of X and Z were assumed to be equal to 1, while the expectation and variance of Y were assumed to be equal to 2 and 4, respectively. These setups assured that each variable had a coefficient of variation of 1. The correlation between X and Y was set at \(\rho _{XY}=0.3;0.6;0.9\); similarly, the correlation between Z and Y was set at \(\rho _{ZY}=0.3;0.6;0.9\), giving rise to nine scenarios. The correlation between X and Z was set at the minimum possible value \(\rho _{XZ}\) such that the resulting variancecovariance matrix is positivedefinite. Once the nine variancecovariance matrices were established the 10000 values of Z and Y and the 7500 values of X were generated using the triangular square root of the variancecovariance matrix (e.g. Johnson 2013, Sect. 4.1). Subsequently, the first \(N_{B(R)}\) units of \(U_{B}\) were assumed to be the respondent portion of the population, ensuring in this way compliance with conditions 1.\(3\)., i.e. the approximate unbiasedness of the double calibration estimator. Simple random sampling without replacement (SRSWOR) was the sampling scheme adopted to select samples of sizes \(n=75;250;500\) from \(U_{B}\). If the same sampling efforts were adopted to select samples from the whole population U and in the absence of nonresponses, then the HT estimator of the total would give rise to relative root means squared errors
where \(CV_{Y}\) is the coefficient of variation of the survey variable. Equation (9) was taken as the benchmark for the performance of the double calibration estimator.
For each combination of respondent sizes \(N_{B(R)}\), correlations between X and Y, correlations between Z and Y, and sample sizes n, 10000 random samples were selected by means of SRSWOR from \(U_{B}\), and the double calibration estimates \({\hat{T}}_{i}=\left( i=1,...,10000\right)\) were computed using equation (5). Moreover, from each simulated sample, the variance estimates \({\hat{V}}_{i}^{2}=\left( i=1,...,10000\right)\) were also computed using equation (8), which under SRSWOR is reduced to
where \(s_{{\hat{u}}}^{2}\) is the sampling variance of the \({\hat{u}}_{j}\)s. Once the variance estimates were computed from (10), the RRMSE estimates \({\hat{RRMSE}}_{i}=\frac{{\hat{V}}_{i}}{{\hat{T}}_{i}}\) were achieved together with the confidence intervals at the nominal level of 0.95, \({\hat{T}}_{i}\pm 2{\hat{V}}_{i}\). Therefore, from the resulting Monte Carlo distributions of these quantities, the expectations \(E({\hat{T}}_{Y(dcal)})=\frac{1}{10000}\sum _{i=1}^{10000}{\hat{T}}_{i}\) and mean squared errors \(MSE({\hat{T}}_{Y(dcal)})=\frac{1}{10000}\sum _{i=1}^{10000}({\hat{T}}_{i}T_{Y})^{2}\) of the double calibration estimator were empirically derived from which the relative bias \(RB=\frac{E({\hat{T}}_{Y(dcal)})T_{Y}}{T_{Y}}\) and the relative root mean squared errors \(RRMSE=\frac{\sqrt{MSE({\hat{T}}_{Y(dcal)})}}{T_{Y}}\) were derived. The expectations of the RRMSE estimator \(ERRMSEE=\frac{1}{10000}\sum _{i=1}^{10000}{\hat{RRMSE}}_{i}\) and the coverage of the 0.95 confidence interval \(COV95=\frac{1}{10000}\sum _{i=1}^{10000}I({\hat{T}}_{i}2{\hat{V}}_{i}\le T_{Y}\le {\hat{T}}_{i}+2{\hat{V}}_{i})\) are also computed. The most relevant results of the Monte Carlo simulations are shown in Tables 1 and 2, while the remaining simulation results are shown in Tables B.1–B.7 of the Appendix B.
The simulation results suggest the following remarks. The first order approximation of relative bias and RRMSE are very accurate in most cases. The discrepancies between approximation and the empirical values achieved from the Monte Carlo distributions are usually smaller than one percent point and become lower with high levels of response and correlations. The theoretical findings for the bias reduction, shown in Sect. 4.1 are fully confirmed by the simulation results. The artificial populations considered in the study meet unbiasedness conditions 1.\(3\). Indeed the empirical values of the relative bias are negligible (invariably about one percentage point) irrespective of the level of correlation of the survey variable with the auxiliaries. While the level of correlation does not affect the bias reduction, it has a relevant impact on the precision. When correlations are strong the double calibration estimator proves efficient, reaching values of RRMSE that are even smaller than those achieved by the HT estimator with the same sampling effort and in the absence of nonresponse and undercoverage. Obviously precision increases with the level of response.
The RRMSE estimator obtained from the variance estimator (8) is approximately unbiased providing also confidence intervals with coverage near to the nominal level of 95% in most cases. Because the estimator (8) actually estimates the approximate variance, some exceptions occur when the variance approximations (and subsequently the RRMSE) turn out be smaller than the true values.
5.2 Robustness of \({\hat{T}}_{Y(dcal)}\) when conditions 1.3. do not hold
Additional simulations were performed to achieve insights on the robustness of the proposed estimator when the approximate unbiasedness conditions 1.\(3\). were moderately violated in such a way that an amount of bias was invariably involved. Indeed, as stated by Särndal and Lundström, (2005, p. 98), when an estimator is biased, its bias should be the main concern, given that “if an estimator is greatly biased, it is poor consolation that its variance is low”. Hence, here too, if a massive bias were present it would heavily impact on RRMSE, deteriorating the estimator performance. To investigate this issue, the linear relationship between Y and X was assumed to be different in the respondent and nonrespondent strata of \(U_{B}\), as in the following scheme:

(a)
when the correlation among Y and X in the respondent stratum was equal to 0.3, the same correlation in the nonrespondent stratum was decreased or increased to 0.2 or to 0.4;

(b)
when the correlation among Y and X in the respondent stratum was equal to 0.6, the same correlation in the nonrespondent stratum is decreased or increased to 0.5 or to 0.7;

(c)
when the correlation among Y and X in the respondent stratum was equal to 0.9, the same correlation in the nonrespondent stratum is decreased or increased to 0.8 or to 0.95.
Similarly, the linear relationship between Y and Z was assumed to be different in the subpopulations \(U_{B}\) and \(UU_{B}\) , following the scheme:

(a)
when the correlation among Y and Z in the subpopulation \(UU_{B}\) was equal to 0.3, the same correlation in the subpopulation \(U_{B}\) was decreased or increased to 0.2 or to 0.4;

(b)
when the correlation among Y and Z in the subpopulation \(UU_{B}\) was equal to 0.6, the same correlation in the subpopulation \(U_{B}\) was decreased or increased to 0.5 or to 0.7;

(c)
when the correlation among Y and Z in the subpopulation \(UU_{B}\) was equal to 0.9, the same correlation in the subpopulation \(U_{B}\) was decreased or increased to 0.8 or to 0.95.
As explained, each decrease or increase in the correlation between Y and X was paired with the corresponding decrease or increase in the correlation between Y and Z, giving rise to a total of twelve scenarios: six representing different (and weaker) relationships of Y with X and Z in the nonrespondent stratum and subpopulation \(U_{B}\), respectively; the other six representing different (and stronger) relationships. As in the previous simulation experiment, expectations and variances of X and Z were assumed to be equal to 1 and expectation and variance of Y were assumed to be equal to 2 and 4, respectively. The correlation between X and Z was set at the minimum possible value ensuring a positivedefinite variancecovariance matrix. Once the twelve variancecovariance were established, simulations proceeded as described above, with the same performance indices computed on the resulting Monte Carlo distributions. Some results are rset out in Tables 3 and 4, while remaining simulation results are given in Tables C.1–C.10 in the Appendix C.
The simulation results suggest the following remarks. The first order approximation of relative bias and RRMSE remain accurate with discrepancies usually smaller than one percent point. Even under different relationships of Ywith X and Z, the relative bias remains moderate (invariably below 1.6 percentage point). The moderate increases in bias also entail moderate increases in RRMSE and approximately unbiased RRMSE estimation, with confidence intervals having coverages near to their nominal value. These results show a promising robustness of the estimator in the presence of moderate differences in the relationships of Y with X and Z in respondent and nonrespondent strata and subpopulations, respectively.
6 An application to the European Union Statistics on Income and Living Conditions survey
National statistical institutes periodically collect data on living conditions through household surveys. Information contents concern several aspects of living conditions, such as, among others, features and expenses incurred to manage the dwelling, material deprivation and welfare indicators, individual and household incomes. The European Union Statistics on Income and Living Conditions survey was created from the previous experience of the European Community Household Panel (ECHP). The survey was launched in 2003 in seven countries (Belgium, Denmark, Greece, Ireland, Luxembourg, Austria and Norway), and was extended to all the 28EU member countries, plus Switzerland, Norway, Iceland, FYROM and Serbia. It is conducted yearly and gathers information about European households. Some rules on how to conduct the survey are established by Eurostat, such as, among others, the frequency and the period to which questions must refer, and the aggregation level of some longitudinal and crosssectional estimates. Other aspects of the survey are set independently by each country, such as, for instance, the sampling design and the sample size, leading to several discrepancies between countries (see, among others, Goedemé 2013; Lohmann 2011).
Moreover, the population coverage of surveys like these is incomplete. Individuals who do not live in households, as well as the homeless, the physically or mentally unable, geographically mobile and displaced individuals are not always represented in nationallevel data. It is estimated that worldwide some 300 to 350 million people may be missing from survey sampling frames, at least 45% omitted altogether by design, or because they are likely to be undercounted (CarrHill 2013). The European Union Statistics on Income and Living Conditions survey, which involves approximately 300,000 households across Europe, is no exception and is affected by undercoverage, and the samples selected are affected by nonresponses. We propose an example of the use of the doublecalibration estimator in the 2013 wave of the European Union Statistics on Income and Living Conditions survey in Denmark (hereafter DKSILC). Data on respondents are freely available from the Eurostat website, while the further information required was taken from Statistics Denmark.
The reference population U consists of households residing in Denmark, except for those habitually living in a foreign country or cohabitations as orphanages, religious institutes, etc. On the Statistics Denmark website, the household population size in 2013 was equal to 2891119 units. The DKSILC survey is based on a simple random sampling without replacement design, so that inclusion probabilities are equal for all units in the population. The sampling unit is the individual person and the household is defined as the household in which the selected person is member. This is because a household in Denmark is defined as comprising one or more individuals. Households eligible for DKSILC are those in which the sampling unit is a person aged 16 or over, living alone or together in private dwellings and through marriage, parentage, affinity or other relationships. Hence, the eligible population \(U_{B}\) of Danish households is equal to 2416597, leading to an undercoverage rate of 0.16%.
The 2013 DKSILC survey was featured a nonresponse rate of about 63%. In fact, the respondent number was equal to 5419, against a sample of 14702 households. Microdata about respondents include a great deal of information, grouped into four sections: Household Register (D), Personal Register (R), Household Data (H) and Personal Data (P). Variables collected concern items, most of them qualitative. To implement the present case study, we use quantitative variables (in euro) with reference to the previous survey year (2012), contained in the Hsection. Specifically, the tax on income and social contributions (HY140G) is used as the X variable to correct for nonresponse, while the total housing cost (HH070) is used as the Z variable to correct for undercoverage. The variable Y to be estimated is total household disposable income (HY020). Sample data suggest that both auxiliary variables are slightly correlated with the variable to be estimated (0.38 for X and Y; 0.17 for Z and Y, in the respondent group), revealing an unfavorable situation, worse than all those presented in Sect. 5. However, from simulation results, the weak relationships between the survey and the auxiliary variable should deteriorate precision but, fortunately, bias reduction should not deteriorate. The estimated total household disposable income is equal to 125739.17 million euros, equivalent to an average household disposable income on U is 43491.52 euros. Since the sampling design is SRSWOR, the variance estimate is computed as in (10) and the RRMSE estimate is 0.05.
The results obtained need to be understood as an illustration and do not claim to be official estimates. Clearly, the quality of the results relies on the quality of the available data. Howsoever, results are in line with those disseminated by Statistics Denmark. In fact, the average disposable income for all households (population U) in 2012 is 329803 Danish krones, corresponding to approximately 44221 euros (at the average exchange rate in 2013).
7 Final remarks
The proposed doublecalibration estimator can be adopted in socioeconomic surveys to jointly account for nonresponse and undercoverage, adopting a twostep calibration. The first calibration, performed to reduce nonresponse bias, requires a set of auxiliary variables whose totals are known for the sampled subpopulations and whose values are known for the respondent units in the sample. The second calibration, performed to reduce the bias generated by the cutoff sampling, requires a further set of auxiliary variables whose totals are known for the whole populations and whose values are known for all the units in the sample. In this setting, no frame is necessary for the nonsampled subpopulation. If the relationships of the survey variable with the two sets of auxiliaries are approximately similar in sampled and nonsampled subpopulations as well as in respondent and nonrespondent strata (conditions 1.\(3\).), the proposed estimator proves to be effective for reducing bias, and is also efficient for highquality auxiliary variables correlated with the variable of interest. Interestingly, bias remains negligible and precision remains satisfactory including after moderate changes in the relationship of the variable of interest with the auxiliary variables in the respondent and nonrespondent strata and subpopulations. Socioeconomic surveys may benefit from the application of the doublecalibration estimator. It leads to results very close to those disseminated by national institutes of statistics and typically achieved by integrating several data sources, with far less effort in terms of data collection and integration.
Change history
23 July 2022
Missing Open Access funding information has been added in the Funding Note.
References
Benedetti R, Bee M, Espa G (2010) A framework for cutoff sampling in business survey design. J Off Stat 26(4):651
Brick JM, Montaquila JM (2009) Nonresponse and weighting. In: Handbook of statistics, volume 29, pages 163–185. Elsevier
CarrHill R (2013) Missing millions and measuring development progress. World Dev 46:30–44
Chang T, Kott PS (2008) Using calibration weighting to adjust for nonresponse under a plausible model. Biometrika 95(3):555–571
Davison AC, Hinkley DV (1997) Bootstrap methods and their application (vol. 1)
De Haan J, Opperdoes E, Schut CM (1999) Item selection in the consumer price index: cutoff versus probability sampling. Surv Methodol 25:31–42
Estevao VM, Särndal CE (2006) Survey estimates by calibration on complex auxiliary information. Int Stat Rev 74(2):127–147
Fattorini L, Franceschi S, Maffei D (2013) Designbased treatment of unit nonresponse in environmental surveys using calibration weighting. Biomet J 55(6):925–943
Fattorini L, Gregoire TG, Trentini S (2018) The use of calibration weighting for variance estimation under systematic sampling: applications to forest cover assessment. J Agric Biol Environ Stat: 1–16
Folsom RE, Singh AC (2000) The generalized exponential model for sampling weight calibration for extreme values, nonresponse, and poststratification. In: Proceedings of the American Statistical Association, Survey Research Methods Section, volume 598603
Glasser G (1962) On the complete coverage of large units in a statistical study. In: Revue de l’Institut International de Statistique, pages 28–32
Goedemé T (2013) How much confidence can we have in eusilc? Complex sample designs and the standard error of the Europe 2020 poverty indicators. Soc Indicat Res 110(1):89–110
Groves RM, Peytcheva E (2008) The impact of nonresponse rates on nonresponse bias: a metaanalysis. Publ Opin Quart 72(2):167–189
Haziza D, Lesage É (2016) A discussion of weighting procedures for unit nonresponse. J Off Stat 32(1):129
Haziza D, Chauvet G, Deville JC (2010) Sampling and estimation in the presence of cutoff sampling. Aust New Zeal J Stat 52(3):303–319
Haziza D, Thompson KJ, Yung W (2010) The effect of nonresponse adjustments on variance estimation. Surv Methodol 36(1):35–43
Hidiroglou MA (1986) The construction of a selfrepresenting stratum of large units in survey design. Am Stat 40(1):27–31
Holt D, Smith TF (1979) Post stratification. J R Stat Soc Ser A (Gen) 142(1):33–46
Johnson ME (2013) Multivariate statistical simulation: A guide to selecting and generating continuous multivariate distributions. John Wiley & Sons
Knaub Jr JR (2008) Cutoff vs. designbased sampling and inference for establishment surveys. InterStat
Kott PS (2006) Using calibration weighting to adjust for nonresponse and coverage errors. Surv Methodol 32(2):133
Lehtonen R, Veijanen A (2009) Designbased methods of estimation for domains and small areas. In: Handbook of statistics, volume 29, pages 219–249. Elsevier
Lohmann H (2011) Comparability of eusilc survey and register data: the relationship among employment, earnings and poverty. J Eur Soc Policy 21(1):37–54
Nicoletti C, Peracchi F, Foliano F (2011) Estimating income poverty in the presence of missing data and measurement error. J Bus Econ Stat 29(1):61–72
Rivest LP (2002) A generalization of the lavallée and hidiroglou algorithm for stratification in business surveys. Surv Methodol 28(2):191–198
Särndal CE, Lundström S (2005) Estimation in surveys with nonresponse. John Wiley & Sons
Särndal CE, Swensson B, Wretman J (1992) Model assisted survey sampling
Sigman RS, Monsour NJ (1995) Selecting samples from list frames of businesses. Bus Surv Methods 295:133
Funding
Open access funding provided by Università degli Studi di Trento within the CRUICARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dickson, M.M., Espa, G., Fattorini, L. et al. Doublecalibration estimators accounting for undercoverage and nonresponse in socioeconomic surveys. Stat Methods Appl 31, 1273–1288 (2022). https://doi.org/10.1007/s10260022006309
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260022006309
Keywords
 Auxiliary variables
 Calibration estimators
 Firstorder Taylor series approximation
 Simulation study