Abstract
In 2015, the European Commission has drafted a framework regulation for integrated European social statistics. This integration covers the Labour Force Survey, the Statistics on Income and Living conditions, and others. In order to avoid an inappropriate response burden, administrative and other sources shall be considered to achieve accurate survey estimates. Combining information from different data sources has become a field of growing research interest among statistical offices and other institutions. In the statistical literature this problem is known as data fusion or statistical matching, and is widely considered as a particular missing-data pattern. Assuming that budgets are limited, and that only some additional information can be obtained to improve the quality of the data fusion, we investigate different scenarios of using these limited resources within an integrated system of household surveys. Our main objective is to develop a framework that fosters on the one hand the estimation of statistical models using several surveys, and on the other hand classical totals for different sub-classes and areas which are of special interest for official statistics.
Similar content being viewed by others
References
Barnard J, Rubin DB (1999) Small-sample degrees of freedom with multiple imputation. Biometrika 86(4):948–955
Battese GE, Harter RM, Fuller WA (1988) An error-components model for prediction of county crop areas using survey and satellite data. J Am Stat Assoc 83(401):28–36
Burgard JP, Kolb JP, Merkle H, Münnich R (2017) Synthetic data for open and reproducible methodological research in social sciences and official statistics. AStA Wirtsch Soz Arch 11(3):233–244. https://doi.org/10.1007/s11943-017-0214-8
Carpenter J, Kenward M (2012) Multiple imputation and its application. Wiley, New York
Das K, Jiang J, Rao JNK (2004) Mean squared error of empirical predictor. Ann Stat 32(2):818–840
Datta GS, Lahiri P (2000) A unified measure of uncertainty of estimated best linear unbiased predictors in small area estimation problems. Stat Sin 10(2):613–627
European Commission (2016) Proposal for a regulation of the European parliament and of the council, establishing a common framework for European statistics relating to persons and households, based on data at individual level collected from samples. COM(2016) 551 final, 2016/0264 (COD)
Fay RE, Herriot RA (1979) Estimates of income for small places: an application of James-Stein procedures to census data. J Am Stat Assoc 74(366):269–277
Gelman A, King G, Liu C (1998) Not asked and not answered: multiple imputation for multiple surveys. J Am Stat Assoc 93(443):846–857
Goldstein H (2011) Multilevel statistical models. Wiley, New York
Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260):663–685
Jiang J, Lahiri P (2006) Mixed model prediction and small area estimation. Test 15:1–96
Kamgar S, Navvabpour H (2017) An efficient method for estimating population parameters using split questionnaire design. J Stat Res Iran 14(1):77–99
Kennickell AB (1991) Imputation of the 1989 survey of consumer finances: stochastic relaxation and multiple imputation. In: Proceedings of the survey research methods section of the American Statistical Association, pp. 1–10
Koller-Meinfelder F (2009) Analysis of incomplete survey data—multiple imputation via Bayesian bootstrap predictive mean matching. PhD thesis, University of Bamberg, Germany
Lehtonen R, Veijanen A (2009) Design-based methods of estimation for domains and small areas. In: Pfeffermann D, Rao C (eds) Sample surveys: inference and analysis, handbook of statistics, vol 29B, chap 31, pp 219–249. North-Holland, Amsterdam
Li H, Liu Y, Zhang R (2017) Small area estimation under transformed nested-error regression models. Stat Pap. https://doi.org/10.1007/s00362-017-0879-7
Little RJ (1988) Missing-data adjustments in large surveys. J Bus Econ Stat 6(3):287–296
McCulloch CE, Searle SR (2001) Generalized, linear and mixed models. Wiley, New York
Münnich R, Burgard J (2012) On the influence of sampling design on small area estimates. J Indian Soc Agric Stat 66(1):145–156
Münnich R, Burgard JP, Vogt M (2013) Small area-statistik: methoden und anwendungen. AStA Wirtsch Soz Archiv 6:149–191
Pfeffermann D, Sverchkov M (1999) Parametric and semi-parametric estimation of regression models fitted to survey data. Sankhyā: Indian J Stat Ser B (1960-2002) 61(1):166–186
R Core Team (2015) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Raghunathan TE, Grizzle JE (1995) A split questionnaire survey design. J Am Stat Assoc 90:54–63
Rao J, Molina I (2015) Small area estimation, 2nd edn. Wiley, New York
Rässler S (2002) Statistical matching. Lecture Notes in Statistics. Springer, New York
Riede T (2013) Die Weiterentwicklung des Systems der amtlichen Haushaltsstatistiken. In: Riede T, Bechtold S, Ott N (eds) Weiterentwicklung der amtlichen Haushaltsstatistiken. SciVero, Berlin
Rodgers WL (1984) An evaluation of statistical matching. J Bus Econ Stat 2:91–102
Rubin DB (1978) Multiple imputation in sample surveys—a phenomological Bayesian approach to nonresponse. In: Proceedings of the Survey Research Method Section of the American Statistical Association, pp 20–34
Rubin DB (1986) Statistical matching using file concatenation with adjusted weights and multiple imputation. J Bus Econ Stat 4:87–95
Rubin DB (1987) Multiple imputation for nonresponse in surveys. Wiley, New York
Särndal CE, Swensson B, Wretman J (2003) Model assisted survey sampling. Springer, New York
Schmid T, Münnich R (2014) Spatial robust small area estimation. Stat Pap 55(3):653–670
Schmid T, Tzavidis N, Münnich R, Chambers R (2016) Outlier robust small-area estimation under spatial correlation. Scand J Stat 43(3):806–826
Sims CA (1972) Comments (on Okner 1972). Ann Econ Soc Meas 1:343–345
Van Buuren S, Groothuis-Oudshoorn K (2011) MICE: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Van Buuren S, Brand JP, Groothuis-Oudshoorn CG, Rubin DB (2006) Fully conditional specification in multivariate imputation. J Stat Comput Simul 76(12):1049–1064
Verret F, Rao J, Hidiroglou MA (2015) Model-based small area estimation under informative sampling. Surv Methodol 41(2):333–347
Ządło T (2009) On MSE of EBLUP. Stat Pap 50(1):101–118
Zhu J, Raghunathan TE (2015) Convergence properties of a sequential regression multiple imputation algorithm. J Am Stat Assoc 110(511):1112–1124
Acknowledgements
This research was supported within the RIFOSS project, financially supported by the German Federal Statistical Office. The first author wishes to thank Allameh Tabatabai University, Tehran, Iran, for providing financial support while working on this paper and during the six months visit at Trier University. Further, we thank the editor and two anonymous reviewers for providing very valuable comments that helped improving the paper.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: simulation study steps
Here, we briefly describe all steps used in our simulation study.
-
1.
A fixed population (size N), variables of interest and auxiliary information are defined.
-
2.
The parameters of interest including the model parameters (regression coefficients of a specified model) and the small area means (as local parameters) are defined .
-
3.
The proposed designs, D0, D1, D2, D3 and D4 are defined.
-
4.
A sample, called MC sample, of size n is selected from the fixed population (SRSWOR).
-
5.
For D0, all parameters of interest are estimated based on the complete information from the MC sample.
-
6.
For each design, D1, D2, D3 or D4, the following steps are performed.
-
7.
The subsample sizes and the overlap size (if needed) are determined. For D1 and D2, the subsample sizes are n1 and n2, where \(n1+n2=n\). For D3 and D4, the subsample sizes are n1 and n2, and the horizontal overlap sample size is n3, where \(n1+ n2-n3=n\).
-
8.
The MC sample (S) is randomly split into two disjoint (for D1 and D2) or overlap subsamples (for D3 and D4), called \(S_1\) and \(S_2\), where \(S_1 \cup S_2 = S\). The \(S_1 \cap S_2\) is denoted as \(S_3\), where \(S_3=\emptyset \), for D1 and D2, and \(S_3 \ne \emptyset \) for D3 and D4.
-
9.
The design is applied on the complete information (available from the MC sample). According to definition of the design, NA values are inserted into dataset for those variables which are not asked from the corresponding sample units.
-
10.
We use the function mice in the mice package of the statistical software R to impute the NA values of dataset. Here, the number of imputations (M), number of iterations, and method of imputation (e.g. predictive mean matching) are determined. Then, M completed datasets are constructed by mice function.
-
11.
In order to estimate the model parameters, the combined point estimates and the corresponding confidence intervals and fractions of missing information are obtained based on M completed dataset, using the function pool of the mice package.
-
12.
In order to estimate the small area means based on M completed datasets, we obtain the estimator for each completed dataset, using different methods (HT, GREG, SAE under unit level model, SAE under area-level model). Then, for each method, the resulting estimators have been combined (using Rubin’s combination formula defined in Sect. 3.1.3 ) to obtain the overall small area estimates, \(\hat{\mu }_{d, \mathrm{HT}}\), \(\hat{\mu }_{d, \mathrm{GREG}}\), \(\hat{\mu }_{d, \mathrm{BHF}}\) and \(\hat{\mu }_{d, \mathrm{FH}}\).
-
13.
As a design-based Monte-Carlo simulation study, we repeat steps 4–12, R times.
-
14.
Finally, all measures of interest are derived.
Appendix B: convergence diagnostics
Boxplots for groups of 10 subsequent iterations (for the version with 100 iteration) and groups of 100 subsequent iterations (for the version with 1000 iterations) help to assess if convergence in distribution can be assumed (Figs. 9 and 10).
Appendix C: coverage probabilities for small area estimates
The area-specific sample sizes vary mainly around 200–300 with outliers of 16, 68, and 97 for small areas (shown as red crosses) as well as 406 and 800 for large areas (shown as blue triangles). The separation between small, medium-size, and large areas was done by the first and third quartile of area-specific sample sizes (Fig. 11).
Rights and permissions
About this article
Cite this article
Kamgar, S., Meinfelder, F., Münnich, R. et al. Estimation within the new integrated system of household surveys in Germany. Stat Papers 61, 2091–2117 (2020). https://doi.org/10.1007/s00362-018-1023-z
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-018-1023-z