Double-calibration estimators accounting for under-coverage and nonresponse in socio-economic surveys

Under-coverage and nonresponse problems are jointly present in most socio-economic surveys. The purpose of this paper is to propose a completely design-based estimation strategy that accounts for both problems without resorting to models but simply performing a two-step calibration. The first calibration exploits a set of auxiliary variables only available for the units in the sampled population to account for nonresponse. The second calibration exploits a different set of auxiliary variables available for the whole population, to account for under-coverage. The two calibrations are then unified in a double-calibration estimator. Mean and variance of the estimator are derived up to the first order of approximation. Conditions ensuring approximate unbiasedness are derived and discussed. The strategy is empirically checked by a simulation study performed on a set of artificial populations. A case study is lead on Danish data coming from the European Union Statistics on Income and Living Conditions survey. The strategy proposed is flexible and suitable in most situations in which both under-coverage and nonresponse are present.


Introduction
) establish four requirements to select a probability sample, which set the perimeter for the definition of a sampling design under the randomization principle. One of them requires that the procedure to select the sample ensures invariably positive probabilities to enter the sample for all units in the population.
This requirement may not be suitable in some situations such as in establishment surveys, such as the Economic Census conducted by the U.S. Census Bureau, in which the population of businesses is characterized by a highly skewed distribution in the survey variables (Glasser, 1962). In this case, different approaches are widespread used, essentially based on the partition of population into strata determined by several business characteristics (e.g. size), and some strata are completely censused, some are sampled, and some are neglected, basing on units features or on the chance to contact them (Sigman and Monsour, 1995). As happen in establishment surveys conducted by the U.S. Bureau of Economic Analysis, very small establishments are excluded a priori from the population to be sampled, due to the costs in build and update a sampling frame for them, against of an expected slight gain in efficiency of the estimators (see e.g. Hidiroglou, 1986;De Haan et al., 1999;Rivest, 2002). These instances are known in literature as cut-off sampling (Knaub Jr, 2008;Benedetti et al., 2010;Haziza et al., 2010a). A similar position can be seen in social surveys on households, such as e.g. the Household Finance and Consumption Survey managed by the European Central Bank, characterized by the missed observation of population units considered ineligibles for the survey, scilicet dwelling that are vacant, not habitable, with non-eligible members, etc., with consequences on the estimation of living conditions and poverty rate (Nicoletti et al., 2011).
Owing to the aforementioned under-coverage of the whole population, the clas-sical Horvitz-Thompson (HT) estimator is biased in these situations. Bias is usually corrected in literature by means of model-based techniques (see, among others, Kott, 2006;Haziza et al., 2010a). Recently, a design-based solution to under-coverage problems has been proposed by Fattorini et al. (2018) adopting a calibration technique in which the weights originally attached to each sample observation are modified in such a way to be able to estimate the population totals of a set of auxiliary variables without error. The rationale behind calibration is well known: if the calibrated weights guess the population totals of the auxiliary variables without errors, they should be suitable also for estimating the total of the survey variable, providing a relationship existing between the survey variable and the auxiliaries. Obviously, calibration is likely to perform well in terms of precision under strong linear relationship.
Socio-economic surveys are also interested by unit nonresponse, which are the more frequent as more sensitive are the survey variables (e.g. sexual behavior, drug consumption, etc.). Even if undesirable, nonresponse is a natural contingency in surveys, so that damages on estimation and inference caused by them need to be addressed (Groves and Peytcheva, 2008). This argument is crucial in survey sampling theory and it is extensively treated in literature (e.g. Brick and Montaquila, 2009). Widely applied methods include poststratification (Holt and Smith, 1979) or, more recently, once again model-based techniques including imputation and nonresponse propensity weighting (Särndal and Lundström, 2005;Haziza et al., 2010b). By means of imputation, nonresponse values are replaced by substitutes and estimation is performed on the completed sample.
Imputed values are customarily obtained by means of a prediction model presuming a relationship between the survey variable and a set of covariates known for all the population units or for all the sampled units. In accordance with the presumed model, commonly used techniques of imputation are, for exam-ple, regression imputation, nearest neighbor imputation, hot deck imputation, and multiple imputation. Nonresponse propensity weighting assumes that each unit of the sampled population has a strictly positive probability to respond. A model is then used to estimate the probabilities of respondent units from the sample by connecting these probabilities to auxiliary information by means of logistic regression models (Chang and Kott, 2008). In addition to this source of uncertainty, the requirement of positive response probability seems tighten in socio-economic surveys, because there will always be units that do not respond in any situation (e.g. homeless and geographically mobile individuals and families). Alternatively,  attempt a complete designbased solution in which population values and nonresponse are viewed as fixed characteristics. To this purpose, they once again use the calibration technique, termed in the literature as nonresponse calibration weighting by Haziza et al. (2010b). In this case, weights originally attached to each respondent observation are modified in such a way to be able at estimating the population totals of a set of auxiliary variables without error.
In most cases under-coverage and nonresponse problems are jointly present in socio-economic surveys. Therefore, a general indication in the treatment of both problems concern the use of any available auxiliary information, even if some of them are not available for all units of the population. In this paper, we build on the availability of a set of auxiliary variables for the whole population while another set is available only for the sampled portion. In establishments surveys, for example, many financial information may be available only for businesses of adequate size, such as corporations, while they may be not for those small businesses excluded from sampling, such as micro-enterprises. Moreover, owing to the recent data collection developments, the additional information may arise from big data, e.g. data coming from internet and telephone use, social networks, online purchases, etc.
The purpose of this paper is to propose double-calibration estimators. The use of calibration in two or more steps has been used by Folsom and Singh (2000) and Estevao and Särndal (2006). Here we propose a completely design-based estimation strategy that considers both under-coverage and nonresponse problems, solving them without resort to models but simply by performing a double calibration. The first calibration exploits a set of auxiliary variables available only for the units in the sampled population to account for nonresponse; the second calibration exploits a different set of auxiliary variables available for the whole population, to account for under-coverage. Joining together the two calibrations, we propose a double-calibration estimator that is applicable to all cases in which both under-coverage and nonresponse problems are present.
The paper is structured as follow. In Section 2, some preliminaries and notations are given. Section 3 is devoted to the costruction of the double-calibration estimator and in Section 4 some design-based properties (expectation and variance) are derived. In order to check the efficiency of the strategy, in Section 5 a Monte Carlo simulation study exploring several scenarios is performed. In Section 6, by using data coming from the European Union Statistics on Income and Living Conditions survey and from Statistics Denmark data, a case study to estimate the total income of Danish households in 2013 is presented and discussed. Some concluding remarks are given in Section 7.

Preliminaries and notation
Denote as U = {u 1 , ..., u N } a finite population of N units. Let y j , with j ∈ U , the value for unit j of the survey variable Y . We aim at estimating the population total T Y = j∈U y j . For the whole population there exists a vector Z of M auxiliary variables whose values z j = [z j1 , ..., z jM ] t are known for each j ∈ U , in such a way that the vector of totals T Z = j∈U z j is also known.
In this setting, for one of the reasons mentioned in the introduction, only a sub-population U B of size N B < N units is sampled using a fixed-size design having first-and second-order inclusion probabilities π j , π jh for any h > j ∈ U B .
Denote by T Y (B) = j∈UB y j the unknown total of Y in U B . Moreover, suppose that additional information exists in the sub-population U B , possibly arising from big data sources. More precisely suppose that there exists a vector X of K auxiliary variables whose values x j = [x j1 , ..., x jM ] t are known for each j ∈ U B in such a way that the vector of totals T X(B) = j∈UB x j is also known.
In this setting, denote by T Z(B) = j∈UB z j the known vector of total of the z j s in the sub-population U B .
A random sample S of n < N B units is selected from the sub-population U B by means of the adopted sampling scheme. As often happens in practice, especially in socio-economic surveys, the sample may be affected by nonresponses, in such a way that the sample is split into two sub-samples, the sub-sample R ⊂ S of the respondent units and the sub-sample S − R of the nonrespondent units.
The above presented setup shows two problems to solve: first, a correction for nonresponses is necessary, in order to estimate T Y (B) ; second, since the sample S is selected from U B and not from U , any T Y (B) estimator is biased, so that it needs for a correction in order to estimate T Y . We propose a design-based calibration in two steps, developed in next sub-sections.
3 The double-calibration estimator 3.1 First calibration: from respondent group to sampled sub-population The first issue to deal with is the nonresponse problem occurring in a sample.
Since S is selected in U B , in absence of nonresponses, it would be possible to estimate T Y (B) by means of the well-known HT estimator andT Y (B) would be an unbiased estimator for T Y (B) . However, owing to nonresponses,T Y (B) is unknown and the response-based estimator is a biased estimator of T Y (B) . Following results obtained in Särndal and Lundström (2005), the bias may be reduced by exploiting theX-vector of auxiliary information. The resulting estimator iŝ RâR is the least-square coefficient vector of the regression of Y vs X, performed on the respondent sample R, i.e.Â R = j∈R xj x t j πj and a R = j∈R yj xj πj and the unit constant is tacitly adopted as the first auxiliary variable in the vectorX.
The design-based properties ofT Y (B)cal are derived in .
The population is partitioned into respondent and nonrespondent strata and the estimator is approximately unbiased if the relationship between Y and X is similar in both the strata. Practically speaking, this condition is similar to those assumed in most model-based nonresponse treatment even if not embedded into models.
3.2 Second calibration: from sampled sub-population to the whole population If we suppose once again that the unit constant is adopted as the first auxil-iary variable in the vector Z, then the calibration estimator (3) could be rewritten asT whereT Z(B) = j∈S zj πj is the HT estimator of the totals of the z j s in the sampled sub-population U B (see Appendix A.1 for the proof).
However, the estimatorT Y (cal) is only virtual, because having the values of the survey variable only for the respondent subset R, neither the HT estimator BĉB are known. Therefore, exploiting equation (4), a double calibration estimator can be constructed Practically speaking, the resulting estimator of the whole population total turns out to bê By the double calibration estimator, the information provided by X and Z is exploited for handling both nonresponses and population under-coverage in a design-based framework.

Design-based properties of the double calibration estimator
Let denote by U B(R) the stratum of respondent units in the sub-population U B and by U B(N R) the stratum of nonrespondent units. As suggested by , let introduce a dummy variable as r j = 1 if j ∈ U B(R) and πj . Due to that, the previous matrices and vectors as well as the double calibration estimatorT Y (dcal) depend on the selection of the sole sample S, while nonresponses are accounted for in the r j s, which are a fixed characteristic of the population, as is required in a design-based perspective.
It is worth noting that in this perspective, , it can be approximated up to the first term by a Taylor series around the true population counter-

Approximate expectation
From the first-order Taylor series approximation ofT Y (dcal) it immediately follows that Exploiting equation (6), after some algebra reported in Appendix A.3, it is proven that the double calibration estimator is unbiased up to the first-order approximation if: 1. the linear relationship between Y and X is similar in the respondent and 2. the linear relationship between Y and Z is similar in the respondent stratum and in the whole sub-population 3. the linear relationship between Y and Z is similar in the two sub-populations

Approximate variance and variance estimation
From equation (A.3) of Appendix A.2, the first-order Taylor series approximation ofT Y (dcal) is rewritten as a translation of a HT estimator, in the sense are the influence values (e.g. . Therefore, the approximate variance ofT Y (dcal) turns out to be (e.g. Särndal et al., 1992, p. 175) On the basis of equation (7), the well-known Sen-Yates-Grundy (SYG) variance estimator is given bŷ are the empirical influence values computed for each sample unit.

Simulation study
Simulations were used to check the performance of the proposed estimator. We considered a population U of N = 10, 000 units and a sub-population U B ⊂ U of N B = 7, 500 units. We supposed that the values z j of an auxiliary variable Z were available for each j ∈ U and were adopted for sample under-coverage calibration. Moreover, we supposed that the values x j of an auxiliary variable X achieved from additional information were available for each j ∈ U B and were adopted in nonresponse calibration. We also supposed that the subpopulation U B was partitioned into respondent and non-respondent strata U B(R) and U B(N R) , respectively. Three sizes were supposed for the respondent stratum, N B(R) = 2, 250; 4, 500; 6, 750 units corresponding to response rates of 30%, 60% and 90%, respectively. The auxiliary variables X and Z and the survey variables Y were generated from a tri-variate normal distribution. The expectations and variances of X and Z were supposed to be equal to 1, while the expectation and variance of Y were supposed to be equal to 2 and 4, respectively. These set ups assured that each variable had a coefficient of variation of 1. The correlation between X and Y was set to be ρ XY = 0.3; 0.6; 0.9; similarly, the correlation between Z and Y was set to be ρ ZY = 0.3; 0.6; 0.9, giving rise to nine scenarios. The correlation between X and Z was set to the minimum possible value ρ XZ such that the resulting variance-covariance matrix is positive definite. Once the nine variance-covariance matrices were established the 10,000 values of Z and Y and the 7,500 values of X were generated using the triangular square root of the variance-covariance matrix (e.g. Johnson, 2013, Sect. 4.1). Subsequently, the first N B(R) units of U B were supposed to be the respondent portion of the population, ensuring in this way conditions 1.-3., i.e. the approximate unbiasedness of the double calibration estimator. Simple random sampling without replacement (SRSWOR) was the sampling scheme adopted to select samples of sizes n = 75; 100; 150; 250; 375; 500 from U B . If the same sampling efforts were adopted to select samples from the whole population U and in absence of nonresponses, then the HT estimator of the total would give rise to relative root means squared errors where CV Y is the coefficient of variation of the survey variable. Equation (9) was taken as benchmark for the performance of the double calibration estimator.
For each combination of respondent sizes N B(R) , correlations between X and Y , correlations between Z and Y , and sample sizes n, 10,000 random samples were selected by means of SRSWOR from U B , and the double calibration esti-matesT i = (i = 1, ..., 10000) were computed using equation (5). Moreover, from each simulated sample, the variance estimatesV 2 i = (i = 1, ..., 10000) were also computed using equation (8), that under SRSWOR reduces tô where s 2 u is the sampling variance of theû j s. Once the variance estimates were computed from (10), the RRM SE estimatesRRM SE i =V î Ti were achieved together with the confidence intervals at the nominal level of 0.95, T i ± 2V i . Therefore, from the resulting Monte Carlo distributions of these quantities, the expectations E(T Y (dcal) ) = 1 10000 10000 i=1T i and mean squared errors The simulation results motivate the following remarks.
The first order approximation of relative bias and RRM SE are very accurate in most cases. The discrepancies between approximation and the empirical values achieved from the Monte Carlo distributions are usually smaller than one percent point that become lower with high levels of response and correlations.
In these cases, approximations turn out to be smaller than the true values at most by three percent points (Table B. In these cases, an under-evaluation of about three percent points produced a coverage only about 80% (see Table B.9). Some rules to conduct the survey are set by the Eurostat, as, among others, the frequency and the period to which questions must be referred, and the aggregation level of some longitudinal and cross-sectional estimates. Other aspects of the survey are set independently by each country, such as, for instance, the sampling design and the sample size, leading to several discrepancies among countries (see, among others, Goedemé, 2013;Lohmann, 2011).
Moreover, the population coverage of surveys like these is incomplete. Individuals who do not live in households, as well as homeless, physically or mentally unable, geographically mobile and displaced individuals are not always represented in national-level data. It is estimated that worldwide some 300 to 350 million people may be missing from survey sampling frames, at least 45% omitted altogether by design, or because they are likely to be undercounted information, most of which are qualitative. To implement the present case study, we use quantitative variables (in euro) referred to the previous year of survey (2012), contained in the H-section. Specifically, the tax on income and social contributions (HY140G) are used as the X variable to correct for nonresponse, while the total housing cost (HH070) are used as the Z variable to correct for under-coverage. The variable Y under estimation is the total household disposable income (HY020). Sample data suggest that both auxiliary variables are slightly correlated with the variable under estimation (0.38 among X and Y ; 0.17 among Z and Y , in the respondent group), revealing an unfavorable situation, worse than all presented in section 5. However, from simulation results, the weak relationships between the survey and the auxiliary variable should de-teriorate precision but, fortunately, should not deteriorate bias reduction. The estimated total household disposable income is equal to 125,739.17 million euro, equivalent to an average household disposable income on U equal to 43,491.52 euro. Since the sampling design is SRSWOR, the variance estimate is computed as in (10)

Final remarks
The double calibration estimator can be adopted in socio-economic surveys to jointly account for nonresponse and under-coverage in a complete designbased framework, without resorting to model but simply adopting a two-step calibration. The first calibration, performed to reduce nonresponse bias, needs for a set of auxiliary variables whose totals are known for the sampled subpopulations and whose values are known for the respondent units in the sample.
The second calibration, performed to reduce the bias generated by the cut-off sampling, needs for a further set of auxiliary variables whose totals are known for the whole populations and whose values are known for all the units in the sample. Interestingly, no list frame is necessary for the non-sampled sub-population.
If the relationships of the survey variable with the two sets of auxiliaries are approximately similar in sampled and non-sampled sub-populations as well as in respondent and nonrespondent strata (conditions 1.-3.), the proposed estimator proves to be effective for reducing bias, being also efficient for high-quality auxiliary variables correlated with the variable of interest. Socio-economic surveys may benefit from application of double-calibration estimator. It leads to results very near with those disseminated by national institutes of statistics and typically collected by integrating several data sources, with far less effort in terms of data collection and integration. Särndal, C.-E., Swensson, B., and Wretman, J. (1992

A Appendix A -Results of Sections 3 and 4
A.1 An alternative formulation of the calibration estimator (3) Because the unit constant is adopted as the first auxiliary variable, then e t 1 z j = 1, ∀j ∈ U , where e t 1 = [1, 0, ..., 0] is the first vector of the standard basis of R M . Therefore, the difference between equations (3) and (4) is given bŷ Hence, (3) and (4) coincide.

A.2 First-order Taylor series approximation ofT Y (dcal)
The double calibration estimator (5) can be rewritten aŝ whereâ kk′ R denotes the kk' element ofÂ where E kk′ is the K-square matrix of 0s, with a 1 in position kk', and E mm′ is the M -square matrix of 0s, with a 1 in position mm'. Evaluating these partial derivatives at the expected pointsâ R = a R ,Â R = A R ,ĉ R = c R ,Ĉ R = C R and T Z(B) = T Z(B) , the first-order Taylor series approximation ofT Y (dcal) gives rise to Grouping the constant terms, equation A.2 can be rewritten as are the influence values (e.g. .

A.3 Approximate unbiasedness of the double calibration estimator
Regarding the first term of equation (6),  show that in such a way that, if condition 2. holds, i.e. if d R ≈ d B then where d N B is the least-square coefficient vector of the regression of Y vs Z performed on the unsampled portion of the population U − U B and the e N B js are the regression residuals.
Because regression residuals sum to 0, then B Appendix B -Results of simulation study presented in Section 5    Table B.8: Percentage values of relative bias (RB), first order approximation of relative root mean squared error (ARRM SE), relative root mean squared error (RRM SE), expectation of the relative root mean squared error estimator (ERRM SEE) and coverage of the nominal 95% confidence intervals achieved from a population of 10,000 units, a sampled sub-population of 7,500 units with 2,250; 4,500 and 7,500 respondent units, sample sizes n = 75; 100; 150; 250; 375; 500 selected by means of simple random sampling without replacement and correlations ρ XY = 0.9, ρ ZY = 0.6 and ρ XZ = 0. Values in parentheses are the RRM SEs of the Horvitz-Thompson estimator in absence of nonresponse and under-coverage.  Table B.9: Percentage values of relative bias (RB), first order approximation of relative root mean squared error (ARRM SE), relative root mean squared error (RRM SE), expectation of the relative root mean squared error estimator (ERRM SEE) and coverage of the nominal 95% confidence intervals achieved from a population of 10,000 units, a sampled sub-population of 7,500 units with 2,250; 4,500 and 7,500 respondent units, sample sizes n = 75; 100; 150; 250; 375; 500 selected by means of simple random sampling without replacement and correlations ρ XY = 0.9, ρ ZY = 0.9 and ρ XZ = 0. Values in parentheses are the RRM SEs of the Horvitz-Thompson estimator in absence of nonresponse and under-coverage.