Abstract
Survey statisticians make use of auxiliary information to improve estimates. One important example is calibration estimation, which constructs new weights that match benchmark constraints on auxiliary variables while remaining “close” to the design weights. Multiple-frame surveys are increasingly used by statistical agencies and private organizations to reduce sampling costs and/or avoid frame undercoverage errors. Several ways of combining estimates derived from such frames have been proposed elsewhere; in this paper, we extend the calibration paradigm, previously used for single-frame surveys, to calculate the total value of a variable of interest in a dual-frame survey. Calibration is a general tool that allows to include auxiliary information from two frames. It also incorporates, as a special case, certain dual-frame estimators that have been proposed previously. The theoretical properties of our class of estimators are derived and discussed, and simulation studies conducted to compare the efficiency of the procedure, using different sets of auxiliary variables. Finally, the proposed methodology is applied to real data obtained from the Barometer of Culture of Andalusia survey.
Similar content being viewed by others
References
Arcos A, Molina D, Rueda M, Ranalli MG (2015) Frames2: a package for estimation in dual frame surveys. R J 7(1):52–72
Bankier MD (1986) Estimators based on several stratified samples with applications to multiple frame surveys. J Am Stat Assoc 81:1074–1079
Berger YG, Muñoz JF, Rancourt E (2009) Variance estimation of survey estimates calibrated on estimated control totals. An application to the extended regression estimator and the regression composite estimator. Comput Stat Data Anal 53(7):2596–2604
Chen S, Kim JK (2014) Population empirical likelihood for nonparametric inference in survey sampling. Stat Sin 24:335–355
Deville JC (2005) Calibration: past, present and future? Paper presented at the Workshop on “Calibration tools for survey statisticians” Neuchâtel
Deville JC, Särndal CE (1992) Calibration estimators in survey sampling. J Am Stat Assoc 87:376–382
Deville JC, Särndal CE, Sautory O (1993) Generalized raking procedures in survey sampling. J Am Stat Assoc 88:1013–1020
Fuller WA, Burmeister LF (1972) Estimators for samples selected from two overlapping frames. In: Proceedings of social science section of The American Statistical Association
Hartley HO (1962) Multiple frame surveys. In: Proceedings of the social statistics section, American Statistical Association, pp 203–206
Hidiroglou M (2001) Double sampling. Surv Methodol 27(2):143–154
Isaki CT, Fuller WA (1982) Survey design under the regression superpopulation model. J Am Stat Assoc 77(377):89–96
Kalton G, Anderson DW (1986) Sampling rare populations. J R Stat Soc Ser A (General) 149:65–82
Kott PS (2011) A nearly pseudo-optimal method for keeping calibration weights from falling below unity in the absence of nonresponse or frame errors. Pak J Stat 27(4):391–396
Kott PS (2014) Calibration weighting when model and calibration variables can differ. In: Mecatti F, Conti PL, Ranalli MG (eds) Contributions to sampling statistics. Springer, Berlin, pp 1–18
Kott PS, Chang T (2010) Using calibration weighting to adjust for nonignorable unit nonresponse. J Am Stat Assoc 105:1265–1275
Lohr SL (2009) Multiple-frame surveys. Handbook of statistics 29:71–88
Lohr SL, Rao JNK (2000) Inference from dual frame surveys. J Am Stat Assoc 95:271–280
Lohr SL, Rao JNK (2006) Estimation in multiple-frame surveys. J Am Stat Assoc 101(475):1019–1030
Mecatti F (2007) A single frame multiplicity estimator for multiple frame surveys. Surv Methodol 33(2):151–157
Rao JNK, Wu C (2010) Pseudo-empirical likelihood inference for multiple frame surveys. J Am Stat Assoc 105(492):1494–1503
Renssen RH, Nieuwenbroek NJ (1997) Aligning estimates for common variables in two or more sample surveys. J Am Stat Assoc 92(437):368–374
Särndal CE (2007) The calibration approach in survey theory and practice. Surv Methodol 33(2):99–119
Särndal CE, Lundström S (2005) Estimation in surveys with nonresponse. Wiley, New York
Singh AC, Mecatti F (2011) Generalized multiplicity-adjusted horvitz-thompson estimation as a unified approach to multiple frame surveys. J Off Stat 27(4):1–19
Skinner CJ (1991) On the efficiency of raking ratio estimation for multiple frame surveys. J Am Stat Assoc 86:779–784
Skinner CJ, Rao JNK (1996) Estimation in dual frame surveys with complex designs. J Am Stat Assoc 91:349–356
Tillé Y, Matei A (2006) The R package sampling, a software tool for training in offcial statistics and survey sampling. In: Proceedings in computational statistics, COMPSTAT’06. Physica-Verlag/Springer, Berlin, pp 1473–1482
Wolter K (2003) Introduction to variance estimation. Springer, New York
Wu C (2005) Algorithms and R codes for the pseudo empirical likelihood method in survey sampling. Surv Methodol 31(2):239
Wu C, Rao JNK (2006) Pseudo-empirical likelihood ratio confidence intervals for complex surveys. Can J Stat 34(3):359–375
Acknowledgments
The authors are grateful to Manuel Trujillo (IESA) for providing data and information about the BACU Survey and to Jean-Claude Deville for useful suggestions on distance metrics in calibration. This study was supported by Ministerio de Educación, Cultura y Deporte (grant MTM2012-35650, Spain), by Consejería de Economía, Innovación, Ciencia y Empleo (grant SEJ2954, Junta de Andalucía, Spain) and under the support of the project PRINSURWEY (grant 2012F42NS8, Italy).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Other examples of the definition of auxiliary variable vector according to available auxiliary information
1.1 Population totals for group membership indicators are known
Let the population \(\mathcal {U}\) be divided into H mutually exclusive groups \(\mathcal {U}_h\), for \(h=1,\ldots ,H\) such that \(\bigcup _{h=1}^H \mathcal {U}_h=\mathcal {U}\) and let \(\delta _k(h)\) be the indicator variable that takes value 1 if unit \(k\in \mathcal {U}_h\) and 0 otherwise, for \(k=1,\ldots ,N\) and \(h=1,\ldots ,H\). Then, \(\sum _{k=1}^N\delta _k(h)=N_h\) and \(\sum _{h=1}^{H}N_h=N\). Now, consider the situation in which we know the population total of such indicator variables for each of the four domains, i.e. \(N_{a,h}=\sum _{k\in a}\delta _k(h)\), \(N_{ab,h}=\sum _{k\in ab}\delta _k(h)\), \(N_{ba,h}=\sum _{k\in ba}\delta _k(h)=N_{ab,h}\), \(N_{b,h}=\sum _{k\in b}\delta _k(h)\), for \(h=1,\ldots ,H\). Note that \(N_{a,h}=\sum _{k\in a}\delta _k(h)=\sum _{k=1}^N\delta _k(a)\delta _k(h)\) and similarly for the other cases. Of course, this type of auxiliary information implies that we also know the dimensions of the three sets \(N_A\), \(N_B\) and \(N_{ab}\) considered in Sect. 3.1. Indeed, it is a special case of the present one.
In this case the vector of auxiliary variables is defined for \(k=1,\ldots ,N\) by
and the vector of known totals is \({\varvec{t}}_x=\{(N_{a,h},\eta N_{ab,h},(1-\eta )N_{ba,h},N_{b,h})\}_{h=1,\ldots ,H}\). As in Sect. 3.1 the minimization problem has an analytic solution irrespective of the distance function employed. This solution is given by
where \(\hat{N}_{a,h}=\sum _{k\in s_a}d_{Ak}\delta _k(h)\) and similarly for the other size estimators. This is another case of complete post-stratification. The final estimator is more efficient than the Hartley estimator to the extent that groups collect units with a similar value of the variable of interest.
On the other hand, when we only know the population total in frame A and in frame B, i.e. we do not know the distribution for the intersection domain ab, then we are again in a situation of incomplete post-stratification, like that of Sect. 3.2. Here,
and \({\varvec{t}}_x=\{(N_{A,h},N_{B,h})\}_{h=1\ldots ,H}\). We have an analytic solution for the form of the weights only for the Euclidean distance case, but it does not take a simple tractable form such as that considered in Sect. 3.2. A similar situation also arises when, as in the case considered later in the application (Sect. 7), we do not know the distribution for, say, age-sex groups, but only the total for age and the total for sex, in each of the two frames A and B. This is another example of incomplete post-stratification, which employs a form of raking (depending on the distance function employed) to obtain the final set of weights (see also examples in Sect. 4).
1.2 \(N_A\), \(N_B\), \(N_{ab}\) known and X known
Suppose that we know the frame sizes \(N_A\), \(N_B\) and \(N_{ab}\), and let the population total of an auxiliary numerical variable be available for the whole population \(X= \sum _{k=1}^N x_{k}\) and not only for frame A as in the previous section. The auxiliary vector is thus \({\varvec{x}}_k=(\delta _k(a),\delta _k(ab),\delta _k(ba),\delta _k(b), x_{k}) \) and the calibration constraints are those in (13) plus \(\sum _{k\in s}w_k^{\circ }x_{k}=X.\)
1.3 \(N_A\), \(N_B\), known and \(X_A\) and \(Z_B\) known
Suppose that we know \(N_A\), \(N_B\) and the population total \(X_A\) defined in Sect. 3.3. In addition, we also know the population total of another auxiliary numerical variable \(z_B\) relative to frame B, whose total is \(Z_B=\sum _{k \in \mathcal {B}} z_{B}\). The auxiliary vector is
and the vector of known totals in this case is \({\varvec{t}}_x=(N_A,N_B,X_A,Z_B)\), which allows us to write the following calibration constraints
1.4 \(N_A\), \(N_B\), \(N_{ab}\) known and \(X_A\), \(X_B\) known
When we know the frame sizes \(N_A\), \(N_B\) and \(N_{ab}\) and the population totals of the same auxiliary variable x in the two frames \(X_A\) and \(X_B\), the auxiliary vector is
and the vector of known totals in this case is \({\varvec{t}}_x=(N_a,\eta N_{ab},(1-\eta )N_{ba},N_b,X_A,X_B)\).
Appendix 2: Technical assumptions and Proofs
1.1 Assumptions
A 1
Let \({\varvec{B}}_U=(\sum _{k=1}^N {\varvec{x}}_k^{T}{\varvec{x}}_k)^{-1}\sum _{k=1}^N{\varvec{x}}_k ^Ty_k\). Assume that \({\varvec{B}}=\lim _{N\rightarrow \infty } {\varvec{B}}_U\) exists; the distribution of \({\varvec{x}}_k\) and of \(y_k\), and the sampling designs are such that \(\sum _{k=1}^N {\varvec{x}}_k^T {\varvec{x}}_k\) is consistently estimated by \(\sum _{k \in s} d_k^{\circ }{\varvec{x}}_k^T {\varvec{x}}_k\) and \(\sum _{k=1}^N {\varvec{x}}_k^T y_k\) is consistently estimated by \(\sum _{k \in s}d_k^{\circ }{\varvec{x}}_k^T y_k\).
A 2
The limiting design covariance matrix of the normalized Hartley estimators,
is positive defined.
A 3
The normalized Hartley estimators of \({\varvec{t}}_x\) and Y are such that a central limit theorem holds:
A 4
The estimated covariance matrix for the Hartley estimator is design consistent in the sense that
where \(v(\hat{Y}_H)=v(\hat{Y}_a+\eta \hat{Y}_{ab})+v((1-\eta )\hat{Y}_{ba}+\hat{Y}_b)\) and similarly for the others.
A 5
\(\phi _s({\varvec{{\uplambda }}})\) is defined on \(C=\bigcap _{k=1}^N\{{\varvec{{\uplambda }}}:{\varvec{x}}_k{\varvec{{\uplambda }}} \in \text{ Im }_k(d_k^{\circ })\}\). C is an open neighbourhood of \({\varvec{0}}\).
A 6
As \(N\rightarrow \infty \), \(\max ||{\varvec{x}}_k||=M<\infty \), \(k=1,\ldots ,N\), and \(\max F_k^{\prime \prime }(0)=M^\prime <\infty \), where \(F_k^{\prime \prime }(\cdot )\) is the second derivative of \(F_k(\cdot )\).
Proof of Theorem 1
By assumptions A1 and A2 we have
Now, \(\hat{Y}_H+({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH}){\varvec{B}}_U\) is such that a central limit theorem holds for A2 and A3, i.e.
where \(\nu ^2={\varSigma }_{yy}-2{\varvec{{\varSigma }}}_{xy}{\varvec{B}}+{\varvec{B}}^{T}{\varvec{{\varSigma }}}_{xx}{\varvec{B}}\). Now, \(N^2n_NV(\hat{t}_{ eH })\rightarrow \nu ^2\) as \(N\rightarrow \infty \), so that \(\hat{Y}_H+ ({\varvec{t}}_x-{\hat{{\varvec{t}}}}_{xH}){\varvec{B}}_U-Y=O_p(Nn_N^{-1/2})\) and the result follows. \(\square \)
Proof of Theorem 2
Let \(\tilde{y}_k={\varvec{x}}_k{\varvec{B}}_U\) and \(\hat{y}_k={\varvec{x}}_k{\hat{{\varvec{\beta }}}}^{\circ }\). Then
Now, for A1, A2 and A4, we have
-
1.
\(v(\hat{t}_{ eH })=V(\hat{t}_{ eH })+o_p(N^2n_N^{-1})\),
-
2.
\(v(\hat{t}_{\tilde{y}-\hat{y},H})=v(\sum _{k \in s}d_k^{\circ }{\varvec{x}}_k({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ }))=({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ })^{T}v({\hat{{\varvec{t}}}}_{xH}) ({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ }) =o_p(1)O_p(N^2n_N^{-1})o_p(1)\),
-
3.
\(c(\hat{t}_{ eH },\hat{t}_{\tilde{y}-\hat{y},H})=c\left( \sum _{k \in s}d_k^{\circ }e_k,\sum _{k \in s}d_k^{\circ }{\varvec{x}}_k({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ })\right) ={\varvec{c}}(\hat{t}_{ eH },{\hat{{\varvec{t}}}}_{xH})({\varvec{B}}_U-{\hat{{\varvec{\beta }}}}^{\circ }) =O_p(N^2n_N^{-1})o_p(1)\).
\(\square \)
Proof of Theorem 3
Using Result 3 in Deville and Särndal (1992)
\(w_k=d_k^{\circ }F(q_k{\varvec{x}}_k{\varvec{{\uplambda }}})=:d_k^{\circ }(1+q_k{\varvec{x}}_k{\varvec{{\uplambda }}})+\epsilon _k(q_k{\varvec{x}}_k{\varvec{{\uplambda }}})\). Assumption A6 ensures that \(\epsilon _k(u)=O_p(u^2)\), therefore
\(\square \)
Rights and permissions
About this article
Cite this article
Ranalli, M.G., Arcos, A., Rueda, M.d.M. et al. Calibration estimation in dual-frame surveys. Stat Methods Appl 25, 321–349 (2016). https://doi.org/10.1007/s10260-015-0336-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10260-015-0336-5