1 Introduction

Särndal et al. (1992, p. 8) establish four requirements to select a probability sample, setting the perimeter for the definition of a sampling design under the randomization principle. One requirement is that the procedure to select the sample ensure invariably positive probabilities to enter the sample for all units in the population.

This requirement may not be suitable in some situations such as in establishment surveys, such as the Economic Census conducted by the U.S. Census Bureau, in which the population of businesses is characterized by a highly skewed distribution in the survey variables (Glasser 1962). In this case, different approaches are commonly used, essentially based on the partition of population into strata determined by several business characteristics (e.g. size), and some strata are completely censused, some are sampled, and some are neglected, based on the features of units or the ability to contact them (Sigman and Monsour 1995). As happens in establishment surveys conducted by the U.S. Bureau of Economic Analysis, very small establishments are excluded a priori from the population to be sampled, due to the costs in building and updating a sampling frame, against an expected slight gain in efficiency of the estimators (see e.g. Hidiroglou 1986; De Haan et al. 1999; Rivest 2002). These instances are known in the literature as cut-off sampling (Knaub 2008; Benedetti et al. 2010; Haziza et al. 2010a). A similar position can be seen in social surveys on households, such as the Household Finance and Consumption Survey managed by the European Central Bank, characterized by the missed observation of population units considered ineligible for the survey, i.e. dwellings that are vacant, not habitable, with non-eligible members, etc., with consequences on the estimation of living conditions and poverty rate (Nicoletti et al. 2011). In this framework it is worth distinguishing between cut-off sampling, alternatively referred to as planned under-coverage, which is often used in socio-economic surveys and unplanned under-coverage which is typical in social surveys. In the first case, auxiliary information is available for all the units in the non-covered portion of the population, whereas only population totals are available in the second case (see e.g. Lehtonen and Veijanen 2009). Owing to the aforementioned under-coverage of the whole population, the unadjusted estimator is biased in these situations. Bias is usually corrected in the literature by means of model-based techniques (see, among others, Kott 2006; Haziza et al. 2010a). Recently, a solution to under-coverage problem has been proposed by Fattorini et al. (2018) in which the properties of the resulting estimator are evaluated in relation to the sampling design while all the population characteristics are held fixed. In particular, the authors propose adopting a calibration technique in which the weights originally attributed to each sample observation are modified in such a way as to be able to estimate the population totals of a set of auxiliary variables without error. The rationale behind calibration is evident: if the calibrated weights guess the population totals of the auxiliary variables without errors, they should also be suitable for estimating the total of the survey variable, providing a relationship exists between the survey variable and the auxiliaries. Obviously, calibration is likely to perform well in terms of precision under a strong linear relationship.

Socio-economic surveys also involve unit nonresponse, the more so the higher the sensitivity of the survey variables (e.g. sexual behavior, drug consumption, etc.). However undesirable, nonresponse is a natural contingency in surveys, so the damage to estimations and inferences needs to be addressed (Groves and Peytcheva 2008). This is crucial in survey sampling theory and is extensively treated in the literature (e.g. Brick and Montaquila 2009). Extensively applied methods include post-stratification (Holt and Smith 1979), response homogeneity groups (Särndal et al. 1992), and, more recently, model-based techniques including imputation and nonresponse propensity weighting (Särndal and Lundström 2005; Haziza et al. 2010b). In particular, nonresponse propensity weighting assumes that each unit of the sampled population has a strictly positive probability to respond. A model is then used to estimate the probabilities of respondent units from the sample by connecting these probabilities to auxiliary information by means of logistic regression models (Chang and Kott 2008). In addition to this source of uncertainty, the requirement of positive response probability seems to tighten in socio-economic surveys, because some units will not respond in any situation (e.g. homeless and geographically mobile individuals and families). Alternatively, Fattorini et al. (2013) attempt a design-based solution in which population values and nonresponse are viewed as fixed characteristics. For this purpose, they once again use the calibration technique, defined in the literature as nonresponse calibration weighting by Haziza et al. (2010b). In this case, weights originally attributed to each respondent unit are modified in such a way as to be able to estimate the population totals of a set of auxiliary variables without error.

In most cases under-coverage and nonresponse problems are jointly present in socio-economic surveys. Therefore, a general indication in the treatment of both problems concerns the use of any available auxiliary information, even if some is not available to all units of the population. In this paper, we build on the availability of a set of auxiliary variables for the whole population while another set is available only for the sampled portion. In establishment surveys, for example, much financial information may be available only for businesses of adequate size, such as corporations, and may not be for small businesses excluded from the sampling, such as micro-enterprises. Moreover, owing to recent data collection developments, the additional information may derive from big data, e.g. data from internet and telephone use, social networks, online purchases, etc.

The purpose of this paper is to propose double-calibration estimators. The use of calibration in two or more steps is not new and has already been used, among others, by Folsom and Singh (2000) and Estevao and Särndal (2006). Moreover, it has been routinely adopted by National Statistical Offices for many years. Here we propose an estimation strategy that considers both under-coverage and nonresponse problems, solving them by performing double calibration. The first calibration exploits a set of auxiliary variables available only for the units in the sampled population to account for nonresponse; the second calibration exploits a different set of auxiliary variables available for the whole population, to account for under-coverage. Joining together the two calibrations, we propose a double-calibration estimator that is applicable to all cases in which both under-coverage and nonresponse problems are present.

The paper is structured as follow. In Sect. 2, some preliminaries and notations are given. Section 3 is devoted to the construction of the double-calibration estimator and in Sect. 4 some statistical properties (expectation and variance) are derived. In order to check the efficiency of the strategy, in Sect. 5 Monte Carlo simulation studies are performed to explore several scenarios. In Sect. 6, using data from the European Union Statistics on Income and Living Conditions survey and from Statistics Denmark data, a case study to estimate the total income of Danish households in 2013 is presented and discussed. Some concluding remarks are given in Sect. 7.

2 Preliminaries and notation

Denote as \(U=\left\{ u_{1},...,u_{N}\right\}\) a finite population of N units. Let \(y_{j}\), with \(j\in U\), the value for unit j of the survey variable Y. We aim to estimate the population total \(T_{Y}=\sum _{j\in U}y_{j}\). For the whole population there exists a vector \({\varvec{Z}}\) of M auxiliary variables whose values \(\varvec{{\varvec{z}}}_{j}=\left[ z_{j1},...,z_{jM}\right] ^{t}\) are known for each \(j\in U\), in such a way that the vector of totals \({\varvec{T}}_{Z}=\sum _{j\in U}{\varvec{z}}_{j}\) is also known.

In this setting, for one of the reasons mentioned in the introduction, only a sub-population \(U_{B}\) of size \(N_{B}<N\) units is sampled using a fixed-size design having first- and second-order inclusion probabilities \(\pi _{j},\pi _{jh}\) for any \(h>j\in U_{B}\). Denote by \(T_{Y(B)}=\sum _{j\in U_{B}}y_{j}\) the unknown total of Y in \(U_{B}\). Moreover, suppose that additional information exists in the sub-population \(U_{B}\). More precisely suppose that there exists a vector \({\varvec{X}}\) of K auxiliary variables whose values \({\varvec{x}}_{j}=\left[ x_{j1},...,x_{jM}\right] ^{t}\) are known for each \(j\in U_{B}\) in such a way that the vector of totals \({\varvec{T}}_{X(B)}=\sum _{j\in U_{B}}{\varvec{x}}_{j}\) is also known. In this setting, denote by \({\varvec{T}}_{Z(B)}=\sum _{j\in U_{B}}\varvec{{\varvec{z}}}_{j}\) the known vector of total of the \({\varvec{z}}_{j}\)s in the sub-population \(U_{B}\).

A random sample S of \(n<N_{B}\) units is selected from the sub-population \(U_{B}\) by means of the adopted sampling scheme. As often happens in practice, especially in socio-economic surveys, the sample may be affected by nonresponses, in such a way that the sample is split into two sub-samples, the sub-sample \(R\subset S\) of the respondent units and the sub-sample \(S-R\) of the nonrespondent units.

The set presented above shows two problems to solve: first, a correction for nonresponses is necessary, in order to estimate \(T_{Y(B)}\); second, since the sample S is selected from \(U_{B}\) and not from U, any \(T_{Y(B)}\) estimator is biased, so a correction is needed in order to estimate \(T_{Y}\). We propose a calibration in two steps, developed in the following sub-sections.

3 The double-calibration estimator

3.1 First calibration: from respondent group to sampled sub-population

The first issue to deal with is the nonresponse problem in a sample. Since S is selected in \(U_{B}\), in the absence of nonresponses, it would be possible to estimate \(T_{Y(B)}\) by means of the well-known Horvitz–Thompson (HT) estimator

$$\begin{aligned} {\hat{T}}_{Y(B)}=\sum _{j\in S}\frac{y_{j}}{\pi _{j}} \end{aligned}$$
(1)

and \({\hat{T}}_{Y(B)}\) would be an unbiased estimator for \(T_{Y(B)}\) if all \(\pi _{j}\) are positive. However, owing to nonresponses, any unadjusted estimator is destined to be a biased estimator of \(T_{Y(B)}\). Following results obtained in Särndal and Lundström (2005), the bias may be reduced by exploiting the \({\varvec{X}}\)-vector of auxiliary information. The resulting estimator is

$$\begin{aligned} {\hat{T}}_{Y(B)cal}=\hat{{\varvec{b}}}_{R}^{t}{\varvec{T}}_{X(B)} \end{aligned}$$
(2)

where \(\hat{{\varvec{b}}}_{R}=\hat{{\varvec{A}}}_{R}^{-1}\hat{{\varvec{a}}}_{R}\) is the least-square coefficient vector of the regression of Y vs \({\varvec{X}},\) performed on the respondent sample R, i.e. \(\hat{{\varvec{A}}}_{R}=\sum _{j\in R}\frac{{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}}{\pi _{j}}\) and \(\hat{{\varvec{a}}}_{R}=\sum _{j\in R}\frac{y_{j}{\varvec{x}}_{j}}{\pi _{j}}\) and the unit constant is tacitly adopted as the first auxiliary variable in the vector \({\varvec{X}}\).

The properties of \({\hat{T}}_{Y(B)cal}\) are derived in Fattorini et al. (2013). The population is partitioned into respondent and nonrespondent strata and the estimator is approximately unbiased if the relationship between Y and \({\varvec{X}}\) is similar in both the strata. Practically speaking, this condition is similar to the one assumed in most model-based nonresponse treatments (for a discussion, see Haziza and Lesage 2016).

3.2 Second calibration: from sampled sub-population to the whole population

Because \({\hat{T}}_{Y(B)cal}\) is, at most, an approximately unbiased estimator of \(T_{Y(B)}\), it is a biased estimator of \(T_{Y}\). Indeed, the sampling scheme adopted to select S generates a sampling design onto \(U_{B}\) but not onto U, and units of \(U-U_{B}\) cannot enter the sample. Therefore, the missed selection of some population units leads to a bias due to population under-coverage and it is necessary to correct the estimator \({\hat{T}}_{Y(B)cal}\).

Fattorini et al. (2018) called these schemes as pseudo designs and proposed a design-based calibration estimation based on a single auxiliary variable having a proportional relationship with the survey variable. In order to extend this approach to vectors of auxiliary variables and to more general linear relationships, the population under-coverage is handled by the calibration criterion proposed by Särndal and Lundström (2005). Specifically, if the \(y_{j}\)s were available for each \(j\in S\), the information furnished by the M auxiliary variables \({\varvec{Z}}\), available for all the population units, could be exploited by means of the calibration estimator

$$\begin{aligned} {\hat{T}}_{Y(cal)}=\hat{{\varvec{d}}}_{B}^{t}{\varvec{T}}_{Z} \end{aligned}$$
(3)

where \(\hat{{\varvec{d}}_{B}}=\hat{{\varvec{C}}}_{B}^{-1}\hat{{\varvec{c}}}_{B}\) is the least-square coefficient vector of the regression of Y vs \({\varvec{Z}}\), performed on the whole sample S, i.e. \(\hat{{\varvec{C}}}_{B}=\sum _{j\in S}\frac{\varvec{{\varvec{z}}}_{j}\varvec{{\varvec{z}}}_{j}^{t}}{\pi _{j}}\) and \(\hat{{\varvec{C}}}_{B}=\sum _{j\in S}\frac{y_{j}\varvec{{\varvec{z}}}_{j}}{\pi _{j}}\).

If we suppose once again that the unit constant is adopted as the first auxiliary variable in the vector \({\varvec{Z}}\), then the calibration estimator (3) could be rewritten as

$$\begin{aligned} {\hat{T}}_{Y(cal)}={\hat{T}}_{Y(B)}+\hat{{\varvec{d}}}_{B}^{t}({\varvec{T}}_{Z}-\hat{{\varvec{T}}}_{Z(B)}) \end{aligned}$$
(4)

where \(\hat{{\varvec{T}}}_{Z(B)}=\sum _{j\in S}\frac{\varvec{{\varvec{z}}}_{j}}{\pi _{j}}\) is the HT estimator of the totals of the \({\varvec{z}}_{j}\)s in the sampled sub-population \(U_{B}\) (see Appendix A.1 for the proof).

However, the estimator \({\hat{T}}_{Y(cal)}\) is only virtual, because knowing the values of the survey variable only for the respondent subset R, neither the HT estimator \({\hat{T}}_{Y(B)}\) nor the least-squares coefficient vector \(\hat{{\varvec{d}}_{B}}=\hat{{\varvec{C}}}_{B}^{-1}\hat{{\varvec{c}}}_{B}\) are known. Therefore, exploiting Eq. (4), a double calibration estimator can be constructed by using \({\hat{T}}_{Y(B)cal}\) instead of \({\hat{T}}_{Y(B)}\) and \(\hat{{\varvec{d}}_{R}}=\hat{{\varvec{C}}}_{R}^{-1}\hat{{\varvec{c}}}_{R}\), instead of \(\hat{{\varvec{d}}_{B}}\) where \(\hat{{\varvec{C}}}_{R}=\sum _{j\in R}\frac{\varvec{{\varvec{z}}}_{j}\varvec{{\varvec{z}}}_{j}^{t}}{\pi _{j}}\) and \(\hat{{\varvec{C}}}_{R}=\sum _{j\in R}\frac{y_{j}\varvec{{\varvec{z}}}_{j}}{\pi _{j}}\). Practically speaking, the resulting estimator of the whole population total turns out to be

$$\begin{aligned} {\hat{T}}_{Y(dcal)}={\hat{T}}_{Y(B)cal}+\hat{{\varvec{d}}}_{R}^{t}\mathrm {(}{\varvec{T}}_{Z}-\hat{{\varvec{T}}}_{Z(B)})=\hat{{\varvec{b}}}_{R}^{t}{\varvec{T}}_{X(B)}+\hat{{\varvec{d}}}_{R}^{t}\mathrm {(}{\varvec{T}}_{Z}-\hat{{\varvec{T}}}_{Z(B)}) \end{aligned}$$
(5)

With the double calibration estimator, the information provided by \({\varvec{X}}\) and \({\varvec{Z}}\) is exploited to handle both nonresponses and population under-coverage.

4 Statistical properties of the double calibration estimator

Denote by \(U_{B(R)}\) the stratum of respondent units in the sub-population \(U_{B}\) and by \(U_{B(NR)}\) the stratum of nonrespondent units. As suggested by Fattorini et al. (2013), introduce a dummy variable as \(r_{j}=1\) if \(j\in U_{B(R)}\) and \(r_{j}=0\) if \(j\in U_{B(NR)}\). Therefore, using the \(r_{j}\)s indicators \(\hat{{\varvec{A}}}_{R}\), \(\hat{{\varvec{a}}}_{R}\), \(\hat{{\varvec{C}}}_{R}\) and \(\hat{{\varvec{c}}}_{R}\) can be rewritten as \(\hat{{\varvec{A}}}_{R}=\sum _{j\in S}\frac{r_{j}{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}}{\pi _{j}}\), \(\hat{{\varvec{a}}}_{R}=\sum _{j\in S}\frac{r_{j}y_{j}\varvec{{\varvec{x}}}_{j}}{\pi _{j}}\),\(\hat{{\varvec{C}}}_{R}=\sum _{j\in S}\frac{r_{j}\varvec{{\varvec{z}}}_{j}\varvec{{\varvec{z}}}_{j}^{t}}{\pi _{j}}\) and \(\hat{{\varvec{C}}}_{R}=\sum _{j\in S}\frac{r_{j}y_{j}\varvec{{\varvec{z}}}_{j}}{\pi _{j}}\). Therefore, the previous matrices and vectors as well as the double calibration estimator \({\hat{T}}_{Y(dcal)}\) depend on the selection of the sole sample S, while nonresponses are accounted for in the \(r_{j}\)s, which are a fixed characteristic of the population.

It is worth noting that in this perspective, \(\hat{{\varvec{A}}}_{R}\), \(\hat{{\varvec{a}}}_{R}\), \(\hat{{\varvec{C}}}_{R}\), \(\hat{{\varvec{c}}}_{R}\) and \(\hat{{\varvec{T}}}_{Z(B)}\) are HT estimators of \({\varvec{A}}_{R}=\sum _{j\in U_{B}}r_{j}{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}=\sum _{j\in U_{B(R)}}{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}\), \({\varvec{a}}_{R}=\sum _{j\in U_{B}}r_{j}y_{j}{\varvec{x}}_{j}=\sum _{j\in U_{B(R)}}y_{j}{\varvec{x}}_{j}\), \({\varvec{C}}_{R}=\sum _{j\in U_{B}}r_{j}{\varvec{z}}_{j}{\varvec{z}}_{j}^{t}=\sum _{j\in U_{B(R)}}{\varvec{z}}_{j}{\varvec{z}}_{j}^{t}\), \({\varvec{c}}_{R}=\sum _{j\in U_{B}}r_{j}y_{j}{\varvec{z}}_{j}=\sum _{j\in U_{B(R)}}y_{j}{\varvec{z}}_{j}\) and of \({\varvec{T}}_{Z(B)}\), respectively. Therefore, because \({\hat{T}}_{Y(dcal)}\) is differentiable with respect to \(\hat{{\varvec{A}}}_{R}\), \(\hat{{\varvec{a}}}_{R}\), \(\hat{{\varvec{C}}}_{R}\), \(\hat{{\varvec{c}}}_{R}\) and \(\hat{{\varvec{T}}}_{Z(B)}\), it can be approximated up to the first term by a Taylor series around the true population counterparts \({\varvec{A}}_{R}\), \({\varvec{a}}_{R}\), \({\varvec{C}}_{R}\), \({\varvec{c}}_{R}\) and \({\varvec{T}}_{Z(B)}\). The equation of the first-order Taylor series approximation of \({\hat{T}}_{Y(dcal)}\) is derived in Appendix A.2.

4.1 Approximate expectation

From the first-order Taylor series approximation of \({\hat{T}}_{Y(dcal)}\) it immediately follows that

$$\begin{aligned} AE({\hat{T}}_{Y(dcal)})={\varvec{b}}_{R}^{t}{\varvec{T}}_{X(B)}+{\varvec{d}}_{R}^{t}({\varvec{T}}_{Z}-{\varvec{T}}_{Z(B)}) \end{aligned}$$
(6)

where \({\varvec{b}}_{R}={\varvec{A}}_{R}^{-1}{\varvec{a}}_{R}\) is the least-square coefficient vector of the regression of Y vs \({\varvec{X}}\) performed on the respondent stratum \(U_{B(R)}\) and \({\varvec{d}}_{R}={\varvec{C}}_{R}^{-1}{\varvec{c}}_{R}\) is the least-square coefficient vector of the regression of Y vs \({\varvec{Z}}\) performed in the same stratum. Exploiting equation (6), after some algebra shown in Appendix A.3, proves that the double calibration estimator is unbiased up to the first-order approximation if:

  1. 1.

    the linear relationship between Y and \({\varvec{X}}\) is similar in the respondent and nonrespondent strata of \(U_{B}\), i.e. \({\varvec{b}}_{R}\approx {\varvec{b}}_{NR}\), where \({\varvec{b}}_{NR}\) is the least-square coefficient vector of the regression of Y vs \({\varvec{X}}\) performed on the nonrespondent stratum \(U_{B(NR)};\)

  2. 2.

    the linear relationship between Y and \({\varvec{Z}}\) is similar in the respondent stratum and in the whole sub-population \(U_{B}\), i.e. \({\varvec{d}}_{R}\approx {\varvec{d}}_{B}\), where \({\varvec{d}}_{B}\) is the least-square coefficient vector of the regression of Y vs \({\varvec{Z}}\) performed on the whole sub-population \(U_{B}\);

  3. 3.

    the linear relationship between Y and \({\varvec{Z}}\) is similar in the two sub-populations \(U_{B}\) and \(U-U_{B}\), i.e. \({\varvec{d}}_{B}\approx {\varvec{d}}_{NB}\), where \({\varvec{d}}_{NB}\) is the least-square coefficient vector of the regression of Y vs \({\varvec{Z}}\) performed on the whole sub-population \(U-U_{B}\).

It is worth noting that the approximate expectation in Eq. (6) does not depend on the design (e.g., first and second order inclusion probabilities), but only on the population characteristics. Therefore, under conditions 1–3, design-unbiasedness holds irrespective of the sampling design adopted.

4.2 Approximate variance and variance estimation

From equation (A.3) of Appendix A.2, the first-order Taylor series approximation of \({\hat{T}}_{Y(dcal)}\) is rewritten as a translation of an HT estimator, in the sense that

$$\begin{aligned} {\hat{T}}_{Y(dcal)}=cost+\sum _{j\in S}\frac{u_{j}}{\pi _{j}} \end{aligned}$$

where

$$\begin{aligned}u_{j}&=r_{j}\left( y_{j}{\varvec{x}}_{j}^{t}-{\varvec{a}}_{R}^{t}{\varvec{A}}_{R}^{-1}{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}\right) {\varvec{A}}_{R}^{-1}{\varvec{T}}_{X(B)}+\\&\quad +r_{j}\left( y_{j}{\varvec{z}}_{j}^{t}-{\varvec{c}}_{R}^{t}{\varvec{C}}_{R}^{-1}{\varvec{z}}_{j}{\varvec{z}}_{j}^{t}\right) {\varvec{C}}_{R}^{-1}\left( {\varvec{T}}_{Z}-{\varvec{T}}_{Z(B)}\right) -{\varvec{c}}_{R}^{t}{\varvec{C}}_{R}^{-1}{\varvec{z}}_{j},j\in U_{B} \end{aligned}$$

are the influence values (e.g. Davison and Hinkley 1997).

Therefore, the approximate variance of \({\hat{T}}_{Y(dcal)}\) turns out to be (e.g. Särndal et al. 1992, p. 175)

$$\begin{aligned} AV\left( {\hat{T}}_{Y(dcal)}\right) =\sum _{h>j\in U_{B}}(\pi _{j}\pi _{h}-\pi _{jh})\left( \frac{u_{j}}{\pi _{j}}-\frac{u_{h}}{\pi _{h}}\right) ^{2} \end{aligned}$$
(7)

On the basis of Eq. (7), the well-known Sen–Yates–Grundy (SYG) variance estimator is given by

$$\begin{aligned} {\hat{V}}_{SYG}^{2}=\sum _{h>j\in S}(\pi _{j}\pi _{h}-\pi _{jh})\left( \frac{{\hat{u}}_{j}}{\pi _{j}}-\frac{{\hat{u}}_{h}}{\pi _{h}}\right) ^{2} \end{aligned}$$
(8)

where

$$\begin{aligned}&{\hat{u}}_{j}=r_{j}\left( y_{j}{\varvec{x}}_{j}^{t}-\hat{{\varvec{a}}}_{R}^{t}\hat{{\varvec{A}}}_{R}^{-1}{\varvec{x}}_{j}{\varvec{x}}_{j}^{t}\right) \hat{{\varvec{A}}}_{R}^{-1}{\varvec{T}}_{X(B)}+\\&\quad +r_{j}\left( y_{j}{\varvec{z}}_{j}^{t}-\hat{{\varvec{C}}}_{R}^{t}\hat{{\varvec{C}}}_{R}^{-1}{\varvec{z}}_{j}{\varvec{z}}_{j}^{t}\right) \hat{{\varvec{C}}}_{R}^{-1}\left( {\varvec{T}}_{Z}-\hat{{\varvec{T}}}_{Z(B)}\right) -\hat{{\varvec{C}}}_{R}^{t}\hat{{\varvec{C}}}_{R}^{-1}{\varvec{z}}_{j},j\in S \end{aligned}$$

are the empirical influence values computed for each sample unit.

5 Simulation study

Simulations were used to check the performance of the proposed estimator. We considered a population U of \(N=10000\) units and a sub-population \(U_{B}\subset U\) of \(N_{B}=7500\) units. We assumed that the values \(z_{j}\) of an auxiliary variable Z were available for each \(j\in U\) and were adopted for sample under-coverage calibration. Moreover, we assumed that the values \(x_{j}\) of an auxiliary variable X achieved from additional information were available for each \(j\in U_{B}\) and were adopted in nonresponse calibration. We also assumed that the sub-population \(U_{B}\) was partitioned into respondent and non-respondent strata \(U_{B(R)}\) and \(U_{B(NR)}\), respectively. Three sizes were assumed for the respondent stratum, \(N_{B(R)}=2250;4500;6750\) units corresponding to response rates of 30%, 60% and 90%, respectively. Moreover, variables were generated respecting some criteria, in order to explore several scenarios, as explained below.

5.1 Unbiasedness of \({\hat{T}}_{Y(dcal)}\)

The auxiliary variables X and Z and the survey variables Y were generated from a tri-variate normal distribution. The expectations and variances of X and Z were assumed to be equal to 1, while the expectation and variance of Y were assumed to be equal to 2 and 4, respectively. These setups assured that each variable had a coefficient of variation of 1. The correlation between X and Y was set at \(\rho _{XY}=0.3;0.6;0.9\); similarly, the correlation between Z and Y was set at \(\rho _{ZY}=0.3;0.6;0.9\), giving rise to nine scenarios. The correlation between X and Z was set at the minimum possible value \(\rho _{XZ}\) such that the resulting variance-covariance matrix is positive-definite. Once the nine variance-covariance matrices were established the 10000 values of Z and Y and the 7500 values of X were generated using the triangular square root of the variance-covariance matrix (e.g. Johnson 2013, Sect. 4.1). Subsequently, the first \(N_{B(R)}\) units of \(U_{B}\) were assumed to be the respondent portion of the population, ensuring in this way compliance with conditions 1.\(-3\)., i.e. the approximate unbiasedness of the double calibration estimator. Simple random sampling without replacement (SRSWOR) was the sampling scheme adopted to select samples of sizes \(n=75;250;500\) from \(U_{B}\). If the same sampling efforts were adopted to select samples from the whole population U and in the absence of nonresponses, then the HT estimator of the total would give rise to relative root means squared errors

$$\begin{aligned} RRMSE_{SRSWOR}=\sqrt{\frac{N-n}{Nn}}CV_{Y} \end{aligned}$$
(9)

where \(CV_{Y}\) is the coefficient of variation of the survey variable. Equation (9) was taken as the benchmark for the performance of the double calibration estimator.

For each combination of respondent sizes \(N_{B(R)}\), correlations between X and Y, correlations between Z and Y, and sample sizes n, 10000 random samples were selected by means of SRSWOR from \(U_{B}\), and the double calibration estimates \({\hat{T}}_{i}=\left( i=1,...,10000\right)\) were computed using equation (5). Moreover, from each simulated sample, the variance estimates \({\hat{V}}_{i}^{2}=\left( i=1,...,10000\right)\) were also computed using equation (8), which under SRSWOR is reduced to

$$\begin{aligned} {\hat{V}}_{SYG}^{2}=N_{B}\left( N_{B}-n\right) \frac{s_{{\hat{u}}}^{2}}{n} \end{aligned}$$
(10)

where \(s_{{\hat{u}}}^{2}\) is the sampling variance of the \({\hat{u}}_{j}\)s. Once the variance estimates were computed from (10), the RRMSE estimates \({\hat{RRMSE}}_{i}=\frac{{\hat{V}}_{i}}{{\hat{T}}_{i}}\) were achieved together with the confidence intervals at the nominal level of 0.95, \({\hat{T}}_{i}\pm 2{\hat{V}}_{i}\). Therefore, from the resulting Monte Carlo distributions of these quantities, the expectations \(E({\hat{T}}_{Y(dcal)})=\frac{1}{10000}\sum _{i=1}^{10000}{\hat{T}}_{i}\) and mean squared errors \(MSE({\hat{T}}_{Y(dcal)})=\frac{1}{10000}\sum _{i=1}^{10000}({\hat{T}}_{i}-T_{Y})^{2}\) of the double calibration estimator were empirically derived from which the relative bias \(RB=\frac{E({\hat{T}}_{Y(dcal)})-T_{Y}}{T_{Y}}\) and the relative root mean squared errors \(RRMSE=\frac{\sqrt{MSE({\hat{T}}_{Y(dcal)})}}{T_{Y}}\) were derived. The expectations of the RRMSE estimator \(ERRMSEE=\frac{1}{10000}\sum _{i=1}^{10000}{\hat{RRMSE}}_{i}\) and the coverage of the 0.95 confidence interval \(COV95=\frac{1}{10000}\sum _{i=1}^{10000}I({\hat{T}}_{i}-2{\hat{V}}_{i}\le T_{Y}\le {\hat{T}}_{i}+2{\hat{V}}_{i})\) are also computed. The most relevant results of the Monte Carlo simulations are shown in Tables 1 and 2, while the remaining simulation results are shown in Tables B.1–B.7 of the Appendix B.

Table 1 Percentage values of RB, ARRMSE, RRMSE, ERRMSEE, COV95 and first order approximation of relative bias (ARB) achieved from a population of 10000 units, a sampled sub-population of 7500 units with 2250, 4500 and 6750 respondent units, sample sizes \(n=75;250;500\) selected by means of simple random sampling without replacement
Table 2 Percentage values of RB, ARRMSE, RRMSE, ERRMSEE, COV95 and first order approximation of relative bias (ARB) achieved from a population of 10000 units, a sampled sub-population of 7500 units with 2250, 4500 and 6750 respondent units, sample sizes \(n=75;250;500\) selected by means of simple random sampling without replacement

The simulation results suggest the following remarks. The first order approximation of relative bias and RRMSE are very accurate in most cases. The discrepancies between approximation and the empirical values achieved from the Monte Carlo distributions are usually smaller than one percent point and become lower with high levels of response and correlations. The theoretical findings for the bias reduction, shown in Sect. 4.1 are fully confirmed by the simulation results. The artificial populations considered in the study meet unbiasedness conditions 1.\(-3\). Indeed the empirical values of the relative bias are negligible (invariably about one percentage point) irrespective of the level of correlation of the survey variable with the auxiliaries. While the level of correlation does not affect the bias reduction, it has a relevant impact on the precision. When correlations are strong the double calibration estimator proves efficient, reaching values of RRMSE that are even smaller than those achieved by the HT estimator with the same sampling effort and in the absence of nonresponse and under-coverage. Obviously precision increases with the level of response.

The RRMSE estimator obtained from the variance estimator (8) is approximately unbiased providing also confidence intervals with coverage near to the nominal level of 95% in most cases. Because the estimator (8) actually estimates the approximate variance, some exceptions occur when the variance approximations (and subsequently the RRMSE) turn out be smaller than the true values.

5.2 Robustness of \({\hat{T}}_{Y(dcal)}\) when conditions 1.-3. do not hold

Additional simulations were performed to achieve insights on the robustness of the proposed estimator when the approximate unbiasedness conditions 1.\(-3\). were moderately violated in such a way that an amount of bias was invariably involved. Indeed, as stated by Särndal and Lundström, (2005, p. 98), when an estimator is biased, its bias should be the main concern, given that “if an estimator is greatly biased, it is poor consolation that its variance is low”. Hence, here too, if a massive bias were present it would heavily impact on RRMSE, deteriorating the estimator performance. To investigate this issue, the linear relationship between Y and X was assumed to be different in the respondent and nonrespondent strata of \(U_{B}\), as in the following scheme:

  1. (a)

    when the correlation among Y and X in the respondent stratum was equal to 0.3, the same correlation in the nonrespondent stratum was decreased or increased to 0.2 or to 0.4;

  2. (b)

    when the correlation among Y and X in the respondent stratum was equal to 0.6, the same correlation in the nonrespondent stratum is decreased or increased to 0.5 or to 0.7;

  3. (c)

    when the correlation among Y and X in the respondent stratum was equal to 0.9, the same correlation in the nonrespondent stratum is decreased or increased to 0.8 or to 0.95.

Similarly, the linear relationship between Y and Z was assumed to be different in the subpopulations \(U_{B}\) and \(U-U_{B}\) , following the scheme:

  1. (a)

    when the correlation among Y and Z in the sub-population \(U-U_{B}\) was equal to 0.3, the same correlation in the sub-population \(U_{B}\) was decreased or increased to 0.2 or to 0.4;

  2. (b)

    when the correlation among Y and Z in the sub-population \(U-U_{B}\) was equal to 0.6, the same correlation in the sub-population \(U_{B}\) was decreased or increased to 0.5 or to 0.7;

  3. (c)

    when the correlation among Y and Z in the sub-population \(U-U_{B}\) was equal to 0.9, the same correlation in the sub-population \(U_{B}\) was decreased or increased to 0.8 or to 0.95.

As explained, each decrease or increase in the correlation between Y and X was paired with the corresponding decrease or increase in the correlation between Y and Z, giving rise to a total of twelve scenarios: six representing different (and weaker) relationships of Y with X and Z in the nonrespondent stratum and subpopulation \(U_{B}\), respectively; the other six representing different (and stronger) relationships. As in the previous simulation experiment, expectations and variances of X and Z were assumed to be equal to 1 and expectation and variance of Y were assumed to be equal to 2 and 4, respectively. The correlation between X and Z was set at the minimum possible value ensuring a positive-definite variance-covariance matrix. Once the twelve variance-covariance were established, simulations proceeded as described above, with the same performance indices computed on the resulting Monte Carlo distributions. Some results are rset out in Tables 3 and 4, while remaining simulation results are given in Tables C.1–C.10 in the Appendix C.

Table 3 Percentage values of RB, ARRMSE, RRMSE, ERRMSEE, COV95 and first order approximation of relative bias (ARB) achieved from a population of 10000 units, a sampled sub-population of 7500 units with 2250, 4500 and 6750 respondent units, sample sizes \(n=75;250;500\) selected by means of simple random sampling without replacement
Table 4 Percentage values of RB, ARRMSE, RRMSE, ERRMSEE, COV95 and first order approximation of relative bias (ARB) achieved from a population of 10000 units, a sampled sub-population of 7500 units with 2250, 4500 and 6750 respondent units, sample sizes \(n=75;250;500\) selected by means of simple random sampling without replacement

The simulation results suggest the following remarks. The first order approximation of relative bias and RRMSE remain accurate with discrepancies usually smaller than one percent point. Even under different relationships of Ywith X and Z, the relative bias remains moderate (invariably below 1.6 percentage point). The moderate increases in bias also entail moderate increases in RRMSE and approximately unbiased RRMSE estimation, with confidence intervals having coverages near to their nominal value. These results show a promising robustness of the estimator in the presence of moderate differences in the relationships of Y with X and Z in respondent and nonrespondent strata and sub-populations, respectively.

6 An application to the European Union Statistics on Income and Living Conditions survey

National statistical institutes periodically collect data on living conditions through household surveys. Information contents concern several aspects of living conditions, such as, among others, features and expenses incurred to manage the dwelling, material deprivation and welfare indicators, individual and household incomes. The European Union Statistics on Income and Living Conditions survey was created from the previous experience of the European Community Household Panel (ECHP). The survey was launched in 2003 in seven countries (Belgium, Denmark, Greece, Ireland, Luxembourg, Austria and Norway), and was extended to all the 28-EU member countries, plus Switzerland, Norway, Iceland, FYROM and Serbia. It is conducted yearly and gathers information about European households. Some rules on how to conduct the survey are established by Eurostat, such as, among others, the frequency and the period to which questions must refer, and the aggregation level of some longitudinal and cross-sectional estimates. Other aspects of the survey are set independently by each country, such as, for instance, the sampling design and the sample size, leading to several discrepancies between countries (see, among others, Goedemé 2013; Lohmann 2011).

Moreover, the population coverage of surveys like these is incomplete. Individuals who do not live in households, as well as the homeless, the physically or mentally unable, geographically mobile and displaced individuals are not always represented in national-level data. It is estimated that worldwide some 300 to 350 million people may be missing from survey sampling frames, at least 45% omitted altogether by design, or because they are likely to be undercounted (Carr-Hill 2013). The European Union Statistics on Income and Living Conditions survey, which involves approximately 300,000 households across Europe, is no exception and is affected by under-coverage, and the samples selected are affected by nonresponses. We propose an example of the use of the double-calibration estimator in the 2013 wave of the European Union Statistics on Income and Living Conditions survey in Denmark (hereafter DK-SILC). Data on respondents are freely available from the Eurostat website, while the further information required was taken from Statistics Denmark.

The reference population U consists of households residing in Denmark, except for those habitually living in a foreign country or cohabitations as orphanages, religious institutes, etc. On the Statistics Denmark website, the household population size in 2013 was equal to 2891119 units. The DK-SILC survey is based on a simple random sampling without replacement design, so that inclusion probabilities are equal for all units in the population. The sampling unit is the individual person and the household is defined as the household in which the selected person is member. This is because a household in Denmark is defined as comprising one or more individuals. Households eligible for DK-SILC are those in which the sampling unit is a person aged 16 or over, living alone or together in private dwellings and through marriage, parentage, affinity or other relationships. Hence, the eligible population \(U_{B}\) of Danish households is equal to 2416597, leading to an under-coverage rate of 0.16%.

The 2013 DK-SILC survey was featured a nonresponse rate of about 63%. In fact, the respondent number was equal to 5419, against a sample of 14702 households. Micro-data about respondents include a great deal of information, grouped into four sections: Household Register (D), Personal Register (R), Household Data (H) and Personal Data (P). Variables collected concern items, most of them qualitative. To implement the present case study, we use quantitative variables (in euro) with reference to the previous survey year (2012), contained in the H-section. Specifically, the tax on income and social contributions (HY140G) is used as the X variable to correct for nonresponse, while the total housing cost (HH070) is used as the Z variable to correct for under-coverage. The variable Y to be estimated is total household disposable income (HY020). Sample data suggest that both auxiliary variables are slightly correlated with the variable to be estimated (0.38 for X and Y; 0.17 for Z and Y, in the respondent group), revealing an unfavorable situation, worse than all those presented in Sect. 5. However, from simulation results, the weak relationships between the survey and the auxiliary variable should deteriorate precision but, fortunately, bias reduction should not deteriorate. The estimated total household disposable income is equal to 125739.17 million euros, equivalent to an average household disposable income on U is 43491.52 euros. Since the sampling design is SRSWOR, the variance estimate is computed as in (10) and the RRMSE estimate is 0.05.

The results obtained need to be understood as an illustration and do not claim to be official estimates. Clearly, the quality of the results relies on the quality of the available data. Howsoever, results are in line with those disseminated by Statistics Denmark. In fact, the average disposable income for all households (population U) in 2012 is 329803 Danish krones, corresponding to approximately 44221 euros (at the average exchange rate in 2013).

7 Final remarks

The proposed double-calibration estimator can be adopted in socio-economic surveys to jointly account for nonresponse and under-coverage, adopting a two-step calibration. The first calibration, performed to reduce nonresponse bias, requires a set of auxiliary variables whose totals are known for the sampled sub-populations and whose values are known for the respondent units in the sample. The second calibration, performed to reduce the bias generated by the cut-off sampling, requires a further set of auxiliary variables whose totals are known for the whole populations and whose values are known for all the units in the sample. In this setting, no frame is necessary for the non-sampled sub-population. If the relationships of the survey variable with the two sets of auxiliaries are approximately similar in sampled and non-sampled sub-populations as well as in respondent and nonrespondent strata (conditions 1.\(-3\).), the proposed estimator proves to be effective for reducing bias, and is also efficient for high-quality auxiliary variables correlated with the variable of interest. Interestingly, bias remains negligible and precision remains satisfactory including after moderate changes in the relationship of the variable of interest with the auxiliary variables in the respondent and nonrespondent strata and subpopulations. Socio-economic surveys may benefit from the application of the double-calibration estimator. It leads to results very close to those disseminated by national institutes of statistics and typically achieved by integrating several data sources, with far less effort in terms of data collection and integration.