Skip to main content
Log in

On Making Valid Inferences by Integrating Data from Surveys and Other Sources

  • Published:
Sankhya B Aims and scope Submit manuscript

Abstract

Survey samplers have long been using probability samples from one or more sources in conjunction with census and administrative data to make valid and efficient inferences on finite population parameters. This topic has received a lot of attention more recently in the context of data from non-probability samples such as transaction data, web surveys and social media data. In this paper, I will provide a brief overview of probability sampling methods first and then discuss some recent methods, based on models for the non-probability samples, which could lead to useful inferences from a non-probability sample by itself or when combined with a probability sample. I will also explain how big data may be used as predictors in small area estimation, a topic of current interest because of the growing demand for reliable local area statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Baker, R., Brick, J. M., Bates, N. A., Battaglia, M., Couper, M. P., Dever, J. A., Gile, K. J. and Tourangeau, R. (2013). Report of the AAPOR task force on non-probability sampling. J. Surv. Statist. Methodol., 1, 90-143.

    Google Scholar 

  • Battese, G. E., Harter, R. M. and Fuller, W. A. (1988). An error component model for prediction of county crop areas using survey and satellite data. J. Am. Stat. Assoc., 83, 28-36.

    Google Scholar 

  • Beaumont, J. – F. (2019). Are probability surveys bound to disappear for the production of official statistics? Technical Report. Statistics Canada.

  • Bethlehem, J. (2016). Solving the nonresponse problem with sample matching. Soc. Sci. Comput. Rev., 34, 59-77.

    Google Scholar 

  • Biemer, P. P. (2018). Quality of official statistics: present and future. Paper presented at the International Methodology Symposium. Statistics Canada, Ottawa.

    Google Scholar 

  • Bose, C. (1943). Note on the sampling error in the method of double sampling. Sankhya, 6, 329-330.

    MATH  MathSciNet  Google Scholar 

  • Brakel Van Den, J. A. and Bethlehem, J. (2008). Model-assisted estimators for official statistics. Discussion Paper 09002, Statistics Netherland.

  • Breidt, F. J. and Opsomer, J. D. (2017). Model-assisted survey estimation with modern prediction techniques. Stat. Sci., 32, 190-205.

    MATH  MathSciNet  Google Scholar 

  • Brick, M. J. (2011). The future of survey sampling. Public Opin. Q., 75, 872-888.

    Google Scholar 

  • Chambers, R. L., Fabrizi, E. and Salvati, N. (2019). Small area estimation with linked data. Technical report appeared as arXiv: 1904.00364v1.

  • Chaudhuri, A. and Christofides, T. (2013). Indirect Questioning in Sample. Springer: New York.

    MATH  Google Scholar 

  • Chen, S. and Haziza, D. (2017). Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika, 104, 439-453.

    MATH  MathSciNet  Google Scholar 

  • Chen, Y., Li, P. and Wu, C. (2018a). Doubly robust inference with non-probability survey samples. Technical Report: arXiv: 1805.06432v1 [stat. ME].

  • Chen, J. K. T., Valliant, R. L. and Elliott, M. R. (2018b). Model-assisted calibration of non-probability sample survey data using adaptive LASSO. Surv. Methodol., 44, 117-144.

    Google Scholar 

  • Citro, C. (2014). From multiple modes for surveys to multiple data sources for estimates. Surv. Methodol., 40, 137-161.

    Google Scholar 

  • Cochran, W. G. (1977). Sampling Techniques, 3rd Edition, Wiley: New York.

    MATH  Google Scholar 

  • Couper, M. P. (2013). Is the sky falling? New technology changing media, and the future of surveys. Surv. Res. Methods, 7, 145-156.

    Google Scholar 

  • Leeuw, E. D. de (2005). To mix or not to mix. Data collection modes for surveys. J. Off. Stat., 21, 233-255.

    Google Scholar 

  • Deville, J. C. and Sarndal, C. E. (1992). Calibration estimators in survey sampling. J. Am. Stat. Assoc., 87, 376-382.

    MATH  MathSciNet  Google Scholar 

  • Elliott, M. R. and Valliant, R. (2017). Inference for nonprobability samples. Stat. Sci., 32, 249-264.

    MATH  MathSciNet  Google Scholar 

  • European Statistical System (2015). ESS Handbook for Quality Reports, 2014 Edition. Luxembourg: Publications Office of the European Union. Available at https://ec.europa.eu/eurostat/documents/3859598/6651706/KS-GQ-15-003-EN-N.pdf.

    Google Scholar 

  • Fay, R. E. and Herriot, R. A. (1979). Estimation of income for small places: An application of James-Stein procedures to census data. J. Am. Stat. Assoc., 74, 269-277.

    Google Scholar 

  • Federal Committee on Statistical Methodology (2018). Transparent Quality Reporting in the Integration of Multiple Data Sources: A Progress Report, 2017-2018. Washington, DC: Federal Committee on Statistical Methodology. Available at https://nces.ed.gov/FCSM/pdf/Quality_Integrated_Data.pdf.

    Google Scholar 

  • Fuller, W. A. (1975). Regression analysis for sample survey. Sankhya Ser. C., 31, 117-132.

    MATH  Google Scholar 

  • Groves, R. M. (2011). Three eras of survey research. Public Opin. Q., 75, 861-871 (Special 75th Anniversary Issue).

    Google Scholar 

  • Groves, R. M. and Heeringa, S. G. (2006). Responsive design for household surveys: Tools for actively controlling survey errors and costs. J. R. Stat. Soc. Ser. A, 169, 439-457.

    MathSciNet  Google Scholar 

  • Guandalini, A. and Tille, Y. (2017). Design-based estimators calibrated on estimated totals from multiple surveys. Int. Stat. Rev., 85, 250-269.

    MathSciNet  Google Scholar 

  • Hall, P. (2003). A short prehistory of the bootstrap. Stat. Sci., 18, 158-167.

    MATH  MathSciNet  Google Scholar 

  • Hansen, M. H. and Hurwitz, W. N. (1943). On the theory of sampling from finite populations. Ann. Math. Stat., 14, 333-362.

    MATH  MathSciNet  Google Scholar 

  • Hansen, M. H. and Hurwitz, W. N. (1946). The problem of non-response in sample surveys. J. Am. Stat. Assoc., 41, 517-529.

    Google Scholar 

  • Hansen, M. H., Hurwitz, W. N., Marks, E. S. and Mauldin, W. P. (1951). Response errors in surveys. J. Am. Stat. Assoc., 46, 147-190.

    MATH  Google Scholar 

  • Hansen, M. H., Hurwitz, W. N., Nisselson, H. and Steinberg, J. (1955). The redesign of the census current population survey. J. Am. Stat. Assoc., 50, 701-719.

    Google Scholar 

  • Hansen, M. H., Madow, W.G. and Tepping, B. J. (1983). An evaluation of model-dependent and probability sampling inferences in sample surveys. J. Am. Stat. Assoc., 78, 776-793.

  • Hartley, H. O. (1962). Multiple frame surveys. Proceedings of the Social Statistics Section, American Statistical Association, 203-206.

  • Hartley, H. O. and Ross, A. (1954). Unbiased ratio estimators. Nature, 174, 270-271.

    Google Scholar 

  • Hidiroglou, M. (2001). Double sampling. Surv. Methodol., 27, 143-154.

    Google Scholar 

  • Hidiroglou, M., Beaumont, J.-F and Yung, W. (2019). Development of a small area estimation system at Statistics Canada. Surv. Methodol., 45, 101-126.

    Google Scholar 

  • Holt, D. T. (2007). The official statistics Olympics challenge: Wider, deeper, quicker, better, cheaper. The American Statistician, 61, 1-8. With commentary by G. Brackstone and J. L. Norwood.

    Google Scholar 

  • Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc., 47, 6630685.

    MATH  MathSciNet  Google Scholar 

  • Kalton, G. (2019). Developments in survey research over the past 60 years: A personal perspective. Int. Stat. Rev., 87, S10-S30.

    MathSciNet  Google Scholar 

  • Keiding, N. and Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. J. R. Soc. Stat. Ser. A, 179, 319-376.

    MathSciNet  Google Scholar 

  • Kim, J. K. and Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Stat. Sin., 24, 375-394.

    MATH  MathSciNet  Google Scholar 

  • Kim, J. K. and Rao, J. N. K. (2012). Combining data from independent surveys: model-assisted approach. Biometrika, 99, 85-100.

    MATH  MathSciNet  Google Scholar 

  • Kim, J. K. and Tam, S-M. (2018). Data integration by combining big data and survey sample data for finite population inference. Submitted for publication.

  • Kim, J. K. and Wang, Z. (2019). Sampling techniques for big data analysts. Int. Stat. Rev. (in press).

  • Kim, J. K., Park, S., Chen, Y. and Wu, C. (2019). Combining non-probability and probability survey samples through mass imputation. Technical Report: arXiv: 1812. 10694v2 [stat.ME].

  • Lee, S. (2006). Propensity score adjustment as a weighting scheme for voluntary panel web surveys. J. Off. Stat., 22, 329-349.

    Google Scholar 

  • Lee, S. and Valliant, R. (2009). Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol. Methods Res., 37, 319-343.

    MathSciNet  Google Scholar 

  • Little, R. J. (2015). Calibrated Bayes, an inferential paradigm for official statistics in the era of big data. Stat. J. IAOS, 31, 555-563.

    Google Scholar 

  • Lohr, S. L. (2011). Alternative survey sample designs: Sampling with multiple overlapping frames. Surv. Methodol., 37, 197-213.

    Google Scholar 

  • Lohr, S. L. and Raghunathan, T. E. (2017). Combining survey data with other data sources. Stat. Sci., 32, 293-312.

    MATH  MathSciNet  Google Scholar 

  • Mahalanobis, P. C. (1944). On large scale sample surveys. Philos. Trans. R. Soc. B, 231, 329-351.

    Google Scholar 

  • Mahalanobis, P. C. (1946). Recent experiments in statistical sampling in the Indian Statistical Institute. J. R. Stat. Soc., 109, 325-378.

    Google Scholar 

  • Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F., Pedreschi, D., Rinzivillo, S., Pappalardo, L and Gabrielli, L. (2015). Small area model-based estimators using big data sources. J. Off. Stat., 31, 263-281.

    Google Scholar 

  • McConville, K. S. and Toth, D. (2018). Automated selection of post-strata using a model-assisted regression tree estimator. Scand. J. Stat. (in press).

  • McConville, K. S., Breidt, F. J., Lee, T. C. and Moisen, G. G. (2017). Model-assisted survey regression estimation with the lasso. J. Surv. Statist. Methodol., 5, 131-158.

    Google Scholar 

  • Mcleod, A. I. AND Bellhouse, D. R. (1983). A convenient algorithm for drawing a simple random sample. Applied Statistics., 32, 182-184.

  • Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat., 12, 685-726.

    MATH  MathSciNet  Google Scholar 

  • Mercer, A. W., Kreuter, F. and Stuart, E. A. (2017). Theory and practice in nonprobability surveys. Public Opin. Q., 81, 250-279.

    Google Scholar 

  • Muhyi, F. A., Sartono, B., Sulvianti, I. D. and Kurnia, A. (2019). Twitter utilization in application of small area estimation to estimate electability of candidate central java governor. IOP Conf. Ser. Earth Environ. Sci., 299 012033, 1-10.

    Google Scholar 

  • Narain, R. D. (1951). On sampling without replacement with varying probabilities. J. Indian Soc. Agric. Stat., 3, 169-174.

    MathSciNet  Google Scholar 

  • National Academies of Sciences, Engineering, and Medicine. (2017). Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. https://doi.org/10.17226/24893.

    Book  Google Scholar 

  • Neyman, J. (1934). On the two different approaches of the representative method. The method of stratified sampling and the method of purposive selection. J. R. Stat. Soc., 97, 558-606.

    MATH  Google Scholar 

  • Pfeffermann, D. and Sverchkov, M. (2007). Small-area estimation under informative probability sampling of area and within the selected areas. J. Am. Stat. Assoc., 102, 1427-1439.

    MATH  MathSciNet  Google Scholar 

  • Porter, A. T., Holan, S. H., Wikle, C. K. and Cressie, N. (2014). Spatial Fay-Herriot model for small area estimation with functional covariates. Spat. Stat., 10, 27-42.

    MathSciNet  Google Scholar 

  • Rao, J. N. K. (1999). Some current trends in sample survey theory and methods (with discussion). Sankhya Ser. B, 61, 1-57.

    MATH  MathSciNet  Google Scholar 

  • Rao, J. N. K. and Fuller, W. A. (2017). Sample survey theory and methods: Past, present and future directions (with discussion). Surv. Methodol., 43, 145-181.

    Google Scholar 

  • Rao, J. N.K. and Molina, I. (2015). Small Area Estimation. Wiley, Hoboken.

    MATH  Google Scholar 

  • Reiter, J. (2008). Multiple imputation when records used for imputation are not used or disseminated for analysis. Biometrika, 95, 933-946.

    MATH  MathSciNet  Google Scholar 

  • Rivers, D. (2007). Sampling for web surveys. In 2007 JSM Proceedings, ASA Section on Survey Research Methods, American Statistical Association.

  • Royall, R. M. (1970). On finite population sampling under certain linear regression models. Biometrika, 57, 377-387.

    MATH  Google Scholar 

  • Sarndal, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol., 33, 99-119.

    Google Scholar 

  • Schenker, N. and Raghunathan, T. (2007). Combining information from multiple surveys to enhance estimation of measure of health. Stat. Med., 26, 1802-1811.

    MathSciNet  Google Scholar 

  • Schmid, T., Bruckschen, F., Salvati, N. and Zbiranski, T. (2017). Constructing sociodemographic indicators for national statistical institutes by using mobile phone data: estimating literacy rates in Senegal. J. R. Stat. Soc. Ser. A, 180, 1163-1190.

    MathSciNet  Google Scholar 

  • Singer, E. (2016). Reflections on surveys’ past and future. J. Surv. Statist. Methodol., 4, 463-475.

    Google Scholar 

  • Singh, A. C., Beresovsky, V. and Ye, C. (2017). Estimation from purposive samples with the aid of probability supplements but without data on the study variable. In 2017 JSM Proceedings, ASA Section on the Survey Research Method Section, American Statistical Association.

  • Smith, T. M. F. (1983). On the validity of inferences from non-random samples. J. R. Stat. Soc. Ser. A, 146, 393-403.

    Google Scholar 

  • Ta, T., Shao, J., Li, Q. and Wang, L. (2019). Generalized regression estimators with high-dimensional covariates. Stat. Sin. (in press).

  • Tam. S. M. and Kim, J. K. (2018). Big data, selection bias and ethics – an official statistician’s perspective. Stat. J. IAOS, 34, 577-588.

  • Thompson, S. K. (2002). Sampling. Wiley: New York.

    MATH  Google Scholar 

  • Thompson, M. E. (2019). Combining data from new and traditional sources in population surveys. Int. Stat. Rev., 87, S79-S89.

    MathSciNet  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B, 58, 267-288.

    MATH  MathSciNet  Google Scholar 

  • Tourangeau, R., Brick, M. J., Lohr, S. and Li, J. (2017). Adaptive and responsive survey designs: a review and assessment. J. R. Stat. Soc. Ser. A, 180, 203-223.

    MathSciNet  Google Scholar 

  • Valliant, R. and Dever, J. A. (2011). Estiamting propensity adjustments for volunteer web surveys. Sociol. Methods Res., 40, 105-137.

    MathSciNet  Google Scholar 

  • Verret, F., Rao, J. N. K. and Hidiroglou, M. H. (2015). Model-based small area estimation under informative sampling. Surv. Methodol., 41, 333-347.

    Google Scholar 

  • Wang, W., Rothschild, D., Goel, S. and Gelman, A. (2015). Forecasting elections with non-representative polls. Int. J. Forecast., 31, 980-991.

    Google Scholar 

  • Williams, D. and Brick, M. J. (2018). Trends in U. S. face-to-face household survey nonresponse and level of effort. J. Surv. Statist. Methodol., 6, 186-211.

    Google Scholar 

  • Woodruff, R. S. (1952). Confidence intervals for medians and other position measures. J. Am. Stat. Assoc., 47, 635-646.

    MATH  MathSciNet  Google Scholar 

  • Wu, C. and Sitter, R. R. (2001). A model-calibrated approach to using complete auxiliary information from survey data. J. Am. Stat. Assoc., 96, 185-193.

    MATH  Google Scholar 

  • Yang, S., Kim, J. K. and Song, R. (2019). Doubly robust inference when combining probability and non-probability samples with high-dimensional data. Technical Report: arXiv: 1903.05212v1 [stat.ME].

  • Ybarra, L. M. R. and Lohr, S. L. (2008). Small area estimation when auxiliary information is measured with error. Biometrika, 95, 919-931.

    MATH  MathSciNet  Google Scholar 

Download references

Acknowledgement

This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada. I thank Jean-Francois Beaumont, Paul Biemer, Mike Brick, Wayne Fuller, Jack Gambino, Graham Kalton, Jae Kim, Frauke Kreuter, Sharon Lohr and Jean Opsomer for some useful comments and suggestions on my paper. I also thank two referees for constructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. N. K. Rao.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rao, J.N.K. On Making Valid Inferences by Integrating Data from Surveys and Other Sources. Sankhya B 83, 242–272 (2021). https://doi.org/10.1007/s13571-020-00227-w

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13571-020-00227-w

Keywords

Navigation