Nonparametric imputation method for nonresponse in surveys

Abstract

Many imputation methods are based on a statistical model that assumes the variable of interest is a noisy observation of a function of the auxiliary variables or covariates. Misspecification of this function may lead to severe errors in estimation and to misleading conclusions. Imputation techniques can therefore benefit from flexible formulations that can capture a wide range of patterns. We consider the use of smoothing splines within an additive model framework to estimate the functional dependence between the variable of interest and the auxiliary variables. The estimator obtained allows us to build an imputation model in the case of multiple auxiliary variables. The performance of our method is assessed via numerical experiments involving simulated and real data.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3

References

  1. Andreis F, Conti PL, Mecatti F (2018) On the role of weights rounding in applications of resampling based on pseudopopulations. Stat Neerl

  2. Andridge RR, Little RJA (2010) A review of dot deck imputation for survey non-response. Int Stat Rev 78:40–64

    Article  Google Scholar 

  3. Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton

    Google Scholar 

  4. Berg E, Kim J-K, Skinner C (2016) Imputation under informative sampling. J Surv Stat Methodol 4(4):436–462

    Article  Google Scholar 

  5. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  6. Central Statistical Office (1993) Family expenditure survey, 1992 [computer file]. Technical report, Colchester, Essex: UK Data Archive [distributor]. SN: 3064. https://doi.org/10.5255/UKDA-SN-3064-1

  7. Chauvet G, Deville J-C, Haziza D (2011) On balanced random imputation in surveys. Biometrika 98:459–471

    MathSciNet  Article  Google Scholar 

  8. Da Silva DN, Opsomer JD (2006) A kernel smoothing method of adjusting for unit non-response in sample surveys. Can J Stat 34(4):563–579

    MathSciNet  Article  Google Scholar 

  9. Da Silva DN, Opsomer JD (2009) Nonparametric propensity weighting for survey nonresponse through local polynomial regression. Surv Methodol 35(2):165–176

    Google Scholar 

  10. Eubank RL (1999) Nonparametric regression and spline smoothing, 2nd edn. Marcel Dekker, New York

    Google Scholar 

  11. Giommi A (1987) Nonparametric methods for estimating individual response probabilities. Surv Methodol 13(2):127–134

    Google Scholar 

  12. Green PJ, Silverman BW (1994) Nonparametric regression and generalized linear models. Chapman & Hall, Boca Raton

    Google Scholar 

  13. Gross ST (1980) Mean estimation in sample surveys. In: Proceedings of the survey research methods section. American Statistical Association, pp 181–184

  14. Hastie TJ, Tibshirani RJ (1986) Generalized additive models. Stat Sci 1(3):297–318

    MathSciNet  Article  Google Scholar 

  15. Hastie TJ, Tibshirani RJ (1990) Generalized additive models. Chapman & Hall, Boca Raton

    Google Scholar 

  16. Haziza D (2009) Imputation and inference in the presence of missing data. In: Rao C (ed) Handbook of statistics, volume 29 of handbook of statistics. Elsevier, Amsterdam, pp 215–246

    Google Scholar 

  17. Haziza D, Rao JNK (2005) Inference for domain means and totals under imputation for missing data. Can J Stat 33:149–161

    Article  Google Scholar 

  18. Lee TCM (2003) Smoothing parameter selection for smoothing splines: a simulation study. Comput Stat Data Anal 42(1):139–148

    MathSciNet  Article  Google Scholar 

  19. Mashreghi Z, Léger C, Haziza D (2014) Bootstrap methods for imputed data from regression, ratio and hot-deck imputation. Can J Stat 42(1):142–167

    MathSciNet  Article  Google Scholar 

  20. Ning J, Cheng P (2012) A comparison study of nonparametric imputation methods. Stat Comput 22:273–285

    MathSciNet  Article  Google Scholar 

  21. Niyonsenga T (1994) Nonparametric estimation of response probabilities in sampling theory. Surv Methodol 20(2):177–184

    Google Scholar 

  22. Niyonsenga T (1997) Response probability estimation. J Stat Plan Inference 59:111–126

    MathSciNet  Article  Google Scholar 

  23. Qin J, Leung D, Shao J (2002) Estimation with survey data under nonignorable nonresponse or informative sampling. J Am Stat Assoc 97(457):193–200

    MathSciNet  Article  Google Scholar 

  24. Rubin DB (1976) Inference and missing data. Biometrika 63:581–592

    MathSciNet  Article  Google Scholar 

  25. Särndal C-E (1992) Methods for estimating the precision of survey estimates when imputation has been used. Surv Methodol 18(2):241–252

    Google Scholar 

  26. Shao J, Sitter RR (1996) Bootstrap for imputed survey data. J Am Stat Assoc 91:1278–1288

    MathSciNet  Article  Google Scholar 

  27. Sitter RR (1992a) Comparing three bootstrap methods for survey data. Can J Stat 20:135–154

    MathSciNet  Article  Google Scholar 

  28. Sitter RR (1992b) A resampling procedure for complex survey data. J Am Stat Assoc 87(416):755–765

    MathSciNet  Article  Google Scholar 

  29. Stekhoven DJ (2013) missForest: nonparametric missing value imputation using random forest. R package version 1:4

  30. Stekhoven D, Buehlmann P (2012) Missforest—nonparametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118

    Article  Google Scholar 

  31. Stones CJ (1985) Additive regression and other nonparametric models. Ann Stat 13(2):689–705

    MathSciNet  Article  Google Scholar 

  32. Wang Y (2011) Smoothing splines: methods and applications. Chapman & Hall, Boca Raton

    Google Scholar 

  33. Wood S (2003) Thin plate regression splines. J R Stat Soc Ser B (Stat Methodol) 65(1):95–114

    MathSciNet  Article  Google Scholar 

  34. Wood S (2008) Fast stable direct fitting and smoothness selection for generalized additive models. J R Stat Soc Ser B (Stat Methodol) 70(3):495–518

    MathSciNet  Article  Google Scholar 

  35. Wood S (2014) mgcv: mixed GAM computation vehicle with GCV/AIC/REML smoothness estimation. R package version 1.7-28. http://CRAN.R-project.org/package=mgcv

  36. Zhang G, Christensen F, Zheng W (2013) Nonparametric regression estimators in complex surveys. J Stat Comput Simul 85(5):1026–1034

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

The authors thank Yves Tillé for his constructive suggestions. This research was supported by the Swiss National Science Foundation and the Natural Science and Engineering Research Council of Canada.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Caren Hasler.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Caren Hasler’s address when the research was conducted is “Institute of Statistics, University of Neuchâtel, Av. de Bellevaux 51, 2000 Neuchâtel, Switzerland”.

Appendix: Bootstrap variance when a randomization is applied

Appendix: Bootstrap variance when a randomization is applied

We repeated the simulations for the bootstrap variance of Sect. 6.1 with sampling fraction \(f = 0.3\) in order to study the impact of randomization on the quality of variance estimates. For the bootstrap variance under SRSWOR, Procedure 1 (MMB) was applied where, in step 1, a sample of size 900 was selected, that is \(n_h' = f \cdot n_h = 900\), \(h=1\) and a randomization was applied in step 2, and Procedure 3 (extended BWO) was applied (k was non-integer) where a randomization was applied in step 1. For the bootstrap variance under stratified sampling, Procedure 1 (MMB) was applied where, in step 1, a sample of size 187 was selected in each stratum, that is \(n_h' = \lfloor f \cdot n_h \rfloor = 187\), where \(\lfloor \cdot \rfloor \) is the floor function, for each stratum h and a randomization was applied in step 2, and Procedure 3 (extended BWO) was applied where a randomization was applied in step 1. Note that randomization was applied in all four cases.

Table 5 Monte Carlo variance of the total, Monte carlo expectation of the bootstrap variance and coverage rate associated with AM imputation for two different sampling designs and five populations

Table 5 shows the result. Under SRS, whether the functional dependence between the variable of interest and the auxiliary variables is additive (populations 1 and 2) or not (populations 3, 4, 5), the bootstrap variance is close to the variance obtained by simulation and it leads to very good coverage rates (between 92% and 94%) across all five populations considered. Under stratified sampling, the bootstrap variance is greater than the variance obtained by simulations in four out of the five populations considered. This difference is greater when the functional dependence between the variable of interest and the auxiliary variables is additive and strong (populations 1 and 2). We explain this phenomenon in what follows.

When a randomization is applied to round the non-integer \(k_h\) and/or \(n_h'\) as it is the case here, the bootstrap variance contains two parts: the variance due to the randomization and the variance of the total estimator. When there is a strong additive functional dependence between the variable of interest and the auxiliary variables, the variance of the total estimator is small. An important portion of the bootstrap variance is due to randomization and the bootstrap variance overestimates the variance of the total. As the additive functional dependence between the variable of interest and the auxiliary variables weakens, the variance of the total estimator increases and the portion of the bootstrap variance due to randomization decreases. The bootstrap variance gets closer to the variance of the total. When stratified sampling is applied, the portion of the variance due to randomization may be particularly important because randomization is applied within each stratum. This explains the difference between the bootstrap variance and the variance obtained by simulations under stratified sampling in Table 5. The simulations run on the real data of Sect. 6.2 confirm this explanation. In this setting, there is a moderate additive functional dependence between the variable of interest and the auxiliary variables. Stratified sampling was used and the randomization procedure was applied to round the non-integer quantities. The obtained bootstrap variance is close to the variance obtained by simulations and yields a coverage rate of 94%.

As shown by these results, randomization affects the quality of the variance estimates. We refer the reader to Andreis et al. (2018) about weights rounding problems in resampling. We repeated the simulation in this section and rounded the non-integer \(k_h\) and \(n_h'\) to the nearest integer instead of applying randomization. This yields very similar results.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hasler, C., Craiu, R.V. Nonparametric imputation method for nonresponse in surveys. Stat Methods Appl 29, 25–48 (2020). https://doi.org/10.1007/s10260-019-00458-w

Download citation

Keywords

  • Additive models
  • Data imputation
  • Sample survey
  • Smoothing spline