Skip to main content
Log in

Calibration estimation in dual-frame surveys

  • Published:
Statistical Methods & Applications Aims and scope Submit manuscript

Abstract

Survey statisticians make use of auxiliary information to improve estimates. One important example is calibration estimation, which constructs new weights that match benchmark constraints on auxiliary variables while remaining “close” to the design weights. Multiple-frame surveys are increasingly used by statistical agencies and private organizations to reduce sampling costs and/or avoid frame undercoverage errors. Several ways of combining estimates derived from such frames have been proposed elsewhere; in this paper, we extend the calibration paradigm, previously used for single-frame surveys, to calculate the total value of a variable of interest in a dual-frame survey. Calibration is a general tool that allows to include auxiliary information from two frames. It also incorporates, as a special case, certain dual-frame estimators that have been proposed previously. The theoretical properties of our class of estimators are derived and discussed, and simulation studies conducted to compare the efficiency of the procedure, using different sets of auxiliary variables. Finally, the proposed methodology is applied to real data obtained from the Barometer of Culture of Andalusia survey.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Arcos A, Molina D, Rueda M, Ranalli MG (2015) Frames2: a package for estimation in dual frame surveys. R J 7(1):52–72

    Google Scholar 

  • Bankier MD (1986) Estimators based on several stratified samples with applications to multiple frame surveys. J Am Stat Assoc 81:1074–1079

    Article  MATH  Google Scholar 

  • Berger YG, Muñoz JF, Rancourt E (2009) Variance estimation of survey estimates calibrated on estimated control totals. An application to the extended regression estimator and the regression composite estimator. Comput Stat Data Anal 53(7):2596–2604

    Article  MathSciNet  MATH  Google Scholar 

  • Chen S, Kim JK (2014) Population empirical likelihood for nonparametric inference in survey sampling. Stat Sin 24:335–355

    MathSciNet  MATH  Google Scholar 

  • Deville JC (2005) Calibration: past, present and future? Paper presented at the Workshop on “Calibration tools for survey statisticians” Neuchâtel

  • Deville JC, Särndal CE (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87:376–382

    Article  MathSciNet  MATH  Google Scholar 

  • Deville JC, Särndal CE, Sautory O (1993) Generalized raking procedures in survey sampling. J Am Stat Assoc 88:1013–1020

    Article  MATH  Google Scholar 

  • Fuller WA, Burmeister LF (1972) Estimators for samples selected from two overlapping frames. In: Proceedings of social science section of The American Statistical Association

  • Hartley HO (1962) Multiple frame surveys. In: Proceedings of the social statistics section, American Statistical Association, pp 203–206

  • Hidiroglou M (2001) Double sampling. Surv Methodol 27(2):143–154

    MathSciNet  Google Scholar 

  • Isaki CT, Fuller WA (1982) Survey design under the regression superpopulation model. J Am Stat Assoc 77(377):89–96

    Article  MathSciNet  MATH  Google Scholar 

  • Kalton G, Anderson DW (1986) Sampling rare populations. J R Stat Soc Ser A (General) 149:65–82

    Article  Google Scholar 

  • Kott PS (2011) A nearly pseudo-optimal method for keeping calibration weights from falling below unity in the absence of nonresponse or frame errors. Pak J Stat 27(4):391–396

    MathSciNet  Google Scholar 

  • Kott PS (2014) Calibration weighting when model and calibration variables can differ. In: Mecatti F, Conti PL, Ranalli MG (eds) Contributions to sampling statistics. Springer, Berlin, pp 1–18

    Google Scholar 

  • Kott PS, Chang T (2010) Using calibration weighting to adjust for nonignorable unit nonresponse. J Am Stat Assoc 105:1265–1275

    Article  MathSciNet  MATH  Google Scholar 

  • Lohr SL (2009) Multiple-frame surveys. Handbook of statistics 29:71–88

    Article  MathSciNet  Google Scholar 

  • Lohr SL, Rao JNK (2000) Inference from dual frame surveys. J Am Stat Assoc 95:271–280

    Article  MathSciNet  MATH  Google Scholar 

  • Lohr SL, Rao JNK (2006) Estimation in multiple-frame surveys. J Am Stat Assoc 101(475):1019–1030

    Article  MathSciNet  MATH  Google Scholar 

  • Mecatti F (2007) A single frame multiplicity estimator for multiple frame surveys. Surv Methodol 33(2):151–157

    Google Scholar 

  • Rao JNK, Wu C (2010) Pseudo-empirical likelihood inference for multiple frame surveys. J Am Stat Assoc 105(492):1494–1503

    Article  MathSciNet  MATH  Google Scholar 

  • Renssen RH, Nieuwenbroek NJ (1997) Aligning estimates for common variables in two or more sample surveys. J Am Stat Assoc 92(437):368–374

    Article  MathSciNet  MATH  Google Scholar 

  • Särndal CE (2007) The calibration approach in survey theory and practice. Surv Methodol 33(2):99–119

    Google Scholar 

  • Särndal CE, Lundström S (2005) Estimation in surveys with nonresponse. Wiley, New York

    Book  MATH  Google Scholar 

  • Singh AC, Mecatti F (2011) Generalized multiplicity-adjusted horvitz-thompson estimation as a unified approach to multiple frame surveys. J Off Stat 27(4):1–19

    Google Scholar 

  • Skinner CJ (1991) On the efficiency of raking ratio estimation for multiple frame surveys. J Am Stat Assoc 86:779–784

    Article  MathSciNet  MATH  Google Scholar 

  • Skinner CJ, Rao JNK (1996) Estimation in dual frame surveys with complex designs. J Am Stat Assoc 91:349–356

    Article  MathSciNet  MATH  Google Scholar 

  • Tillé Y, Matei A (2006) The R package sampling, a software tool for training in offcial statistics and survey sampling. In: Proceedings in computational statistics, COMPSTAT’06. Physica-Verlag/Springer, Berlin, pp 1473–1482

  • Wolter K (2003) Introduction to variance estimation. Springer, New York

    MATH  Google Scholar 

  • Wu C (2005) Algorithms and R codes for the pseudo empirical likelihood method in survey sampling. Surv Methodol 31(2):239

    Google Scholar 

  • Wu C, Rao JNK (2006) Pseudo-empirical likelihood ratio confidence intervals for complex surveys. Can J Stat 34(3):359–375

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The authors are grateful to Manuel Trujillo (IESA) for providing data and information about the BACU Survey and to Jean-Claude Deville for useful suggestions on distance metrics in calibration. This study was supported by Ministerio de Educación, Cultura y Deporte (grant MTM2012-35650, Spain), by Consejería de Economía, Innovación, Ciencia y Empleo (grant SEJ2954, Junta de Andalucía, Spain) and under the support of the project PRINSURWEY (grant 2012F42NS8, Italy).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Giovanna Ranalli.

Appendices

Appendix 1: Other examples of the definition of auxiliary variable vector according to available auxiliary information

1.1 Population totals for group membership indicators are known

Let the population \(\mathcal {U}\) be divided into H mutually exclusive groups \(\mathcal {U}_h\), for \(h=1,\ldots ,H\) such that \(\bigcup _{h=1}^H \mathcal {U}_h=\mathcal {U}\) and let \(\delta _k(h)\) be the indicator variable that takes value 1 if unit \(k\in \mathcal {U}_h\) and 0 otherwise, for \(k=1,\ldots ,N\) and \(h=1,\ldots ,H\). Then, \(\sum _{k=1}^N\delta _k(h)=N_h\) and \(\sum _{h=1}^{H}N_h=N\). Now, consider the situation in which we know the population total of such indicator variables for each of the four domains, i.e. \(N_{a,h}=\sum _{k\in a}\delta _k(h)\), \(N_{ab,h}=\sum _{k\in ab}\delta _k(h)\), \(N_{ba,h}=\sum _{k\in ba}\delta _k(h)=N_{ab,h}\), \(N_{b,h}=\sum _{k\in b}\delta _k(h)\), for \(h=1,\ldots ,H\). Note that \(N_{a,h}=\sum _{k\in a}\delta _k(h)=\sum _{k=1}^N\delta _k(a)\delta _k(h)\) and similarly for the other cases. Of course, this type of auxiliary information implies that we also know the dimensions of the three sets \(N_A\), \(N_B\) and \(N_{ab}\) considered in Sect. 3.1. Indeed, it is a special case of the present one.

In this case the vector of auxiliary variables is defined for \(k=1,\ldots ,N\) by

$$\begin{aligned} {\varvec{x}}_k=\{(\delta _k(a)\delta _k(h),\delta _k(ab)\delta _k(h),\delta _k(ba)\delta _k(h),\delta _k(b)\delta _k(h)\}_{h=1,\ldots ,H} \end{aligned}$$

and the vector of known totals is \({\varvec{t}}_x=\{(N_{a,h},\eta N_{ab,h},(1-\eta )N_{ba,h},N_{b,h})\}_{h=1,\ldots ,H}\). As in Sect. 3.1 the minimization problem has an analytic solution irrespective of the distance function employed. This solution is given by

$$\begin{aligned} w_k^{\circ } = \left\{ \begin{array}{l@{\quad }l} d_{Ak}{N_{a,h}}/{\hat{N}_{a,h}} &{} \text {if } k \in \{s_a \cap \mathcal {U}_h\}\\ \eta \, d_{Ak} {N_{ab,h}}/{\hat{N}_{ab,h}} &{} \text {if } k \in \{s_{ab} \cap \mathcal {U}_h\}\\ (1-\eta )\, d_{Bk} {N_{ba,h}}/{\hat{N}_{ba,h}} &{} \text {if } k \in \{s_{ba} \cap \mathcal {U}_h\}\\ d_{Bk}{N_{b,h}}/{\hat{N}_{b,h} } &{} \text {if } k \in \{s_b \cap \mathcal {U}_h\}\\ \end{array} \right. \quad \text{ for }\, h=1,\ldots ,H, \end{aligned}$$
(22)

where \(\hat{N}_{a,h}=\sum _{k\in s_a}d_{Ak}\delta _k(h)\) and similarly for the other size estimators. This is another case of complete post-stratification. The final estimator is more efficient than the Hartley estimator to the extent that groups collect units with a similar value of the variable of interest.

On the other hand, when we only know the population total in frame A and in frame B, i.e. we do not know the distribution for the intersection domain ab, then we are again in a situation of incomplete post-stratification, like that of Sect. 3.2. Here,

$$\begin{aligned} {\varvec{x}}_k=\{[\delta _k(a)+\delta _k(ab)+\delta _k(ba)]\delta _k(h),[\delta _k(b)+\delta _k(ab)+\delta _k(ba)]\delta _k(h)\}_{h=1\ldots ,H} \end{aligned}$$

and \({\varvec{t}}_x=\{(N_{A,h},N_{B,h})\}_{h=1\ldots ,H}\). We have an analytic solution for the form of the weights only for the Euclidean distance case, but it does not take a simple tractable form such as that considered in Sect. 3.2. A similar situation also arises when, as in the case considered later in the application (Sect. 7), we do not know the distribution for, say, age-sex groups, but only the total for age and the total for sex, in each of the two frames A and B. This is another example of incomplete post-stratification, which employs a form of raking (depending on the distance function employed) to obtain the final set of weights (see also examples in Sect. 4).

1.2 \(N_A\), \(N_B\), \(N_{ab}\) known and X known

Suppose that we know the frame sizes \(N_A\), \(N_B\) and \(N_{ab}\), and let the population total of an auxiliary numerical variable be available for the whole population \(X= \sum _{k=1}^N x_{k}\) and not only for frame A as in the previous section. The auxiliary vector is thus \({\varvec{x}}_k=(\delta _k(a),\delta _k(ab),\delta _k(ba),\delta _k(b), x_{k}) \) and the calibration constraints are those in (13) plus \(\sum _{k\in s}w_k^{\circ }x_{k}=X.\)

1.3 \(N_A\), \(N_B\), known and \(X_A\) and \(Z_B\) known

Suppose that we know \(N_A\), \(N_B\) and the population total \(X_A\) defined in Sect. 3.3. In addition, we also know the population total of another auxiliary numerical variable \(z_B\) relative to frame B, whose total is \(Z_B=\sum _{k \in \mathcal {B}} z_{B}\). The auxiliary vector is

$$\begin{aligned} {\varvec{x}}_k= & {} (\delta _k(a)+\delta _k(ab)+\delta _k(ba),\delta _k(b)+\delta _k(ab)+\delta _k(ba),\\&[\delta _k(a)+\delta _k(ab)+\delta _k(ba)] x_{Ak},[\delta _k(b)+\delta _k(ab)+\delta _k(ba)] z_{Bk}) \end{aligned}$$

and the vector of known totals in this case is \({\varvec{t}}_x=(N_A,N_B,X_A,Z_B)\), which allows us to write the following calibration constraints

$$\begin{aligned}&\sum _{k\in s_a}w_k^{\circ }+\sum _{k\in s_{ab}}w_k^{\circ }+\sum _{k\in s_{ba}}w_k^{\circ }=N_A \nonumber \\&\sum _{k\in s_b}w_k^{\circ }+\sum _{k\in s_{ab}}w_k^{\circ }+\sum _{k\in s_{ba}}w_k^{\circ }=N_B,\nonumber \\&\sum _{k\in s_a}w_k^{\circ }x_{Ak}+\sum _{k\in s_{ab}}w_k^{\circ }x_{Ak}+\sum _{k\in s_{ba}}w_k^{\circ }x_{Ak}=X_A \nonumber \\&\sum _{k\in s_b}w_k^{\circ }z_{Bk}+\sum _{k\in s_{ab}}w_k^{\circ }z_{Bk}+\sum _{k\in s_{ba}}w_k^{\circ }z_{Bk}=Z_B. \end{aligned}$$
(23)

1.4 \(N_A\), \(N_B\), \(N_{ab}\) known and \(X_A\), \(X_B\) known

When we know the frame sizes \(N_A\), \(N_B\) and \(N_{ab}\) and the population totals of the same auxiliary variable x in the two frames \(X_A\) and \(X_B\), the auxiliary vector is

$$\begin{aligned} {\varvec{x}}_k= & {} (\delta _k(a),\delta _k(ab),\delta _k(ba),\delta _k(b), [\delta _k(a)+\delta _k(ab)+\delta _k(ba)] x_{k},\\&[\delta _k(b)+\delta _k(ab)+\delta _k(ba)] x_{k}) \end{aligned}$$

and the vector of known totals in this case is \({\varvec{t}}_x=(N_a,\eta N_{ab},(1-\eta )N_{ba},N_b,X_A,X_B)\).

Appendix 2: Technical assumptions and Proofs

1.1 Assumptions

A 1

Let \({\varvec{B}}_U=(\sum _{k=1}^N {\varvec{x}}_k^{T}{\varvec{x}}_k)^{-1}\sum _{k=1}^N{\varvec{x}}_k ^Ty_k\). Assume that \({\varvec{B}}=\lim _{N\rightarrow \infty } {\varvec{B}}_U\) exists; the distribution of \({\varvec{x}}_k\) and of \(y_k\), and the sampling designs are such that \(\sum _{k=1}^N {\varvec{x}}_k^T {\varvec{x}}_k\) is consistently estimated by \(\sum _{k \in s} d_k^{\circ }{\varvec{x}}_k^T {\varvec{x}}_k\) and \(\sum _{k=1}^N {\varvec{x}}_k^T y_k\) is consistently estimated by \(\sum _{k \in s}d_k^{\circ }{\varvec{x}}_k^T y_k\).

A 2

The limiting design covariance matrix of the normalized Hartley estimators,

$$\begin{aligned} {\varvec{{\varSigma }}}= \begin{bmatrix} {\varSigma }_{yy}&\quad {\varvec{{\varSigma }}}_{xy} \\ {\varvec{{\varSigma }}}_{xy}^{T}&\quad {\varvec{{\varSigma }}}_{xx}\\ \end{bmatrix} =\lim _{N\rightarrow \infty } \frac{n_N}{N^2} \begin{bmatrix} V(\hat{Y}_H)&\quad {\varvec{C}}({\hat{{\varvec{t}}}}_{xH},\hat{Y}_H) \\ {\varvec{C}}({\hat{{\varvec{t}}}}_{xH},\hat{Y}_H)^{T}&\quad {\varvec{V}}({\hat{{\varvec{t}}}}_{xH})\\ \end{bmatrix} \end{aligned}$$

is positive defined.

A 3

The normalized Hartley estimators of \({\varvec{t}}_x\) and Y are such that a central limit theorem holds:

$$\begin{aligned} \frac{\sqrt{n_N}}{N} \begin{bmatrix} \sum _{k \in s} d_k^{\circ } y_k - Y \\ \sum _{k \in s} d_k^{\circ } {\varvec{x}}_k^{T} - {\varvec{t}}_x^T\\ \end{bmatrix} \rightarrow ^{\mathcal {L}} N({\varvec{0}},{\varvec{{\varSigma }}}). \end{aligned}$$

A 4

The estimated covariance matrix for the Hartley estimator is design consistent in the sense that

$$\begin{aligned} \frac{n_N}{N^2} \begin{bmatrix} v(\hat{Y}_H)&{\varvec{c}}({\hat{{\varvec{t}}}}_{xH},\hat{Y}_H) \\ {\varvec{c}}({\hat{{\varvec{t}}}}_{xH},\hat{Y}_H)^{T}&{\varvec{v}}({\hat{{\varvec{t}}}}_{xH})\\ \end{bmatrix} -{\varvec{{\varSigma }}}=o_p(1), \end{aligned}$$

where \(v(\hat{Y}_H)=v(\hat{Y}_a+\eta \hat{Y}_{ab})+v((1-\eta )\hat{Y}_{ba}+\hat{Y}_b)\) and similarly for the others.

A 5

\(\phi _s({\varvec{{\uplambda }}})\) is defined on \(C=\bigcap _{k=1}^N\{{\varvec{{\uplambda }}}:{\varvec{x}}_k{\varvec{{\uplambda }}} \in \text{ Im }_k(d_k^{\circ })\}\). C is an open neighbourhood of \({\varvec{0}}\).

A 6

As \(N\rightarrow \infty \), \(\max ||{\varvec{x}}_k||=M<\infty \), \(k=1,\ldots ,N\), and \(\max F_k^{\prime \prime }(0)=M^\prime <\infty \), where \(F_k^{\prime \prime }(\cdot )\) is the second derivative of \(F_k(\cdot )\).

Proof of Theorem 1

By assumptions A1 and A2 we have

$$\begin{aligned} \hat{Y}_{\mathrm{GREG}}-Y&=\hat{Y}_H+({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH}){\varvec{B}}_U-Y+({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH})({\hat{{\varvec{\beta }}}}^{\circ }-{\varvec{B}}_U)\\&=\hat{Y}_H+({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH}){\varvec{B}}_U-Y+O_p(Nn_N^{-1/2})o_p(1). \end{aligned}$$

Now, \(\hat{Y}_H+({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH}){\varvec{B}}_U\) is such that a central limit theorem holds for A2 and A3, i.e.

$$\begin{aligned} \frac{\sqrt{n_N}}{N}(\hat{Y}_H+({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH}){\varvec{B}}_U-Y)\rightarrow ^{\mathcal {L}}N(0,\nu ^2) \end{aligned}$$

where \(\nu ^2={\varSigma }_{yy}-2{\varvec{{\varSigma }}}_{xy}{\varvec{B}}+{\varvec{B}}^{T}{\varvec{{\varSigma }}}_{xx}{\varvec{B}}\). Now, \(N^2n_NV(\hat{t}_{ eH })\rightarrow \nu ^2\) as \(N\rightarrow \infty \), so that \(\hat{Y}_H+ ({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH}){\varvec{B}}_U-Y=O_p(Nn_N^{-1/2})\) and the result follows. \(\square \)

Proof of Theorem 2

Let \(\tilde{y}_k={\varvec{x}}_k{\varvec{B}}_U\) and \(\hat{y}_k={\varvec{x}}_k{\hat{{\varvec{\beta }}}}^{\circ }\). Then

$$\begin{aligned} v(\hat{t}_{\hat{e}H})&=v(\hat{t}_{\hat{e}H}+\hat{t}_{ eH }-\hat{t}_{ eH })\nonumber \\&=v\left( \sum _{k \in s}d_k^{\circ }\hat{e}_k+\sum _{k \in s}d_k^{\circ }e_k-\sum _{k \in s}d_k^{\circ }e_k\right) \nonumber \\&=v\left( \sum _{k \in s}d_k^{\circ }e_k+\sum _{k \in s}d_k^{\circ }(y_k-\hat{y}_k-y_k+\tilde{y}_k)\right) \nonumber \\&=v(\hat{t}_{ eH })+v(\hat{t}_{\tilde{y}-\hat{y},H})+2c(\hat{t}_{ eH },\hat{t}_{\tilde{y}-\hat{y},H}). \end{aligned}$$
(24)

Now, for A1, A2 and A4, we have

  1. 1.

    \(v(\hat{t}_{ eH })=V(\hat{t}_{ eH })+o_p(N^2n_N^{-1})\),

  2. 2.

    \(v(\hat{t}_{\tilde{y}-\hat{y},H})=v(\sum _{k \in s}d_k^{\circ }{\varvec{x}}_k({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ }))=({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ })^{T}v({\hat{{\varvec{t}}}}_{xH}) ({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ }) =o_p(1)O_p(N^2n_N^{-1})o_p(1)\),

  3. 3.

    \(c(\hat{t}_{ eH },\hat{t}_{\tilde{y}-\hat{y},H})=c\left( \sum _{k \in s}d_k^{\circ }e_k,\sum _{k \in s}d_k^{\circ }{\varvec{x}}_k({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ })\right) ={\varvec{c}}(\hat{t}_{ eH },{\hat{{\varvec{t}}}}_{xH})({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ }) =O_p(N^2n_N^{-1})o_p(1)\).

\(\square \)

Proof of Theorem 3

Using Result 3 in Deville and Särndal (1992)

$$\begin{aligned} {\varvec{{\uplambda }}}=\left( \sum _{k \in s}d_k^{\circ }q_k\,{\varvec{x}}_k^T{\varvec{x}}_k\right) ^{-1}({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH})^{T}+O_p(n_N^{-1}), \end{aligned}$$

\(w_k=d_k^{\circ }F(q_k{\varvec{x}}_k{\varvec{{\uplambda }}})=:d_k^{\circ }(1+q_k{\varvec{x}}_k{\varvec{{\uplambda }}})+\epsilon _k(q_k{\varvec{x}}_k{\varvec{{\uplambda }}})\). Assumption A6 ensures that \(\epsilon _k(u)=O_p(u^2)\), therefore

$$\begin{aligned} \hat{Y}_{\mathrm{CAL}}=\hat{Y}_{\mathrm{GREG}}+O_p(Nn_N^{-1})+O_p(Nn_N^{-2}). \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ranalli, M.G., Arcos, A., Rueda, M.d.M. et al. Calibration estimation in dual-frame surveys. Stat Methods Appl 25, 321–349 (2016). https://doi.org/10.1007/s10260-015-0336-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10260-015-0336-5

Keywords

Navigation