Abstract
Survey samplers have long been using probability samples from one or more sources in conjunction with census and administrative data to make valid and efficient inferences on finite population parameters. This topic has received a lot of attention more recently in the context of data from non-probability samples such as transaction data, web surveys and social media data. In this paper, I will provide a brief overview of probability sampling methods first and then discuss some recent methods, based on models for the non-probability samples, which could lead to useful inferences from a non-probability sample by itself or when combined with a probability sample. I will also explain how big data may be used as predictors in small area estimation, a topic of current interest because of the growing demand for reliable local area statistics.
Similar content being viewed by others
References
Baker, R., Brick, J. M., Bates, N. A., Battaglia, M., Couper, M. P., Dever, J. A., Gile, K. J. and Tourangeau, R. (2013). Report of the AAPOR task force on non-probability sampling. J. Surv. Statist. Methodol., 1, 90-143.
Battese, G. E., Harter, R. M. and Fuller, W. A. (1988). An error component model for prediction of county crop areas using survey and satellite data. J. Am. Stat. Assoc., 83, 28-36.
Beaumont, J. – F. (2019). Are probability surveys bound to disappear for the production of official statistics? Technical Report. Statistics Canada.
Bethlehem, J. (2016). Solving the nonresponse problem with sample matching. Soc. Sci. Comput. Rev., 34, 59-77.
Biemer, P. P. (2018). Quality of official statistics: present and future. Paper presented at the International Methodology Symposium. Statistics Canada, Ottawa.
Bose, C. (1943). Note on the sampling error in the method of double sampling. Sankhya, 6, 329-330.
Brakel Van Den, J. A. and Bethlehem, J. (2008). Model-assisted estimators for official statistics. Discussion Paper 09002, Statistics Netherland.
Breidt, F. J. and Opsomer, J. D. (2017). Model-assisted survey estimation with modern prediction techniques. Stat. Sci., 32, 190-205.
Brick, M. J. (2011). The future of survey sampling. Public Opin. Q., 75, 872-888.
Chambers, R. L., Fabrizi, E. and Salvati, N. (2019). Small area estimation with linked data. Technical report appeared as arXiv: 1904.00364v1.
Chaudhuri, A. and Christofides, T. (2013). Indirect Questioning in Sample. Springer: New York.
Chen, S. and Haziza, D. (2017). Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika, 104, 439-453.
Chen, Y., Li, P. and Wu, C. (2018a). Doubly robust inference with non-probability survey samples. Technical Report: arXiv: 1805.06432v1 [stat. ME].
Chen, J. K. T., Valliant, R. L. and Elliott, M. R. (2018b). Model-assisted calibration of non-probability sample survey data using adaptive LASSO. Surv. Methodol., 44, 117-144.
Citro, C. (2014). From multiple modes for surveys to multiple data sources for estimates. Surv. Methodol., 40, 137-161.
Cochran, W. G. (1977). Sampling Techniques, 3rd Edition, Wiley: New York.
Couper, M. P. (2013). Is the sky falling? New technology changing media, and the future of surveys. Surv. Res. Methods, 7, 145-156.
Leeuw, E. D. de (2005). To mix or not to mix. Data collection modes for surveys. J. Off. Stat., 21, 233-255.
Deville, J. C. and Sarndal, C. E. (1992). Calibration estimators in survey sampling. J. Am. Stat. Assoc., 87, 376-382.
Elliott, M. R. and Valliant, R. (2017). Inference for nonprobability samples. Stat. Sci., 32, 249-264.
European Statistical System (2015). ESS Handbook for Quality Reports, 2014 Edition. Luxembourg: Publications Office of the European Union. Available at https://ec.europa.eu/eurostat/documents/3859598/6651706/KS-GQ-15-003-EN-N.pdf.
Fay, R. E. and Herriot, R. A. (1979). Estimation of income for small places: An application of James-Stein procedures to census data. J. Am. Stat. Assoc., 74, 269-277.
Federal Committee on Statistical Methodology (2018). Transparent Quality Reporting in the Integration of Multiple Data Sources: A Progress Report, 2017-2018. Washington, DC: Federal Committee on Statistical Methodology. Available at https://nces.ed.gov/FCSM/pdf/Quality_Integrated_Data.pdf.
Fuller, W. A. (1975). Regression analysis for sample survey. Sankhya Ser. C., 31, 117-132.
Groves, R. M. (2011). Three eras of survey research. Public Opin. Q., 75, 861-871 (Special 75th Anniversary Issue).
Groves, R. M. and Heeringa, S. G. (2006). Responsive design for household surveys: Tools for actively controlling survey errors and costs. J. R. Stat. Soc. Ser. A, 169, 439-457.
Guandalini, A. and Tille, Y. (2017). Design-based estimators calibrated on estimated totals from multiple surveys. Int. Stat. Rev., 85, 250-269.
Hall, P. (2003). A short prehistory of the bootstrap. Stat. Sci., 18, 158-167.
Hansen, M. H. and Hurwitz, W. N. (1943). On the theory of sampling from finite populations. Ann. Math. Stat., 14, 333-362.
Hansen, M. H. and Hurwitz, W. N. (1946). The problem of non-response in sample surveys. J. Am. Stat. Assoc., 41, 517-529.
Hansen, M. H., Hurwitz, W. N., Marks, E. S. and Mauldin, W. P. (1951). Response errors in surveys. J. Am. Stat. Assoc., 46, 147-190.
Hansen, M. H., Hurwitz, W. N., Nisselson, H. and Steinberg, J. (1955). The redesign of the census current population survey. J. Am. Stat. Assoc., 50, 701-719.
Hansen, M. H., Madow, W.G. and Tepping, B. J. (1983). An evaluation of model-dependent and probability sampling inferences in sample surveys. J. Am. Stat. Assoc., 78, 776-793.
Hartley, H. O. (1962). Multiple frame surveys. Proceedings of the Social Statistics Section, American Statistical Association, 203-206.
Hartley, H. O. and Ross, A. (1954). Unbiased ratio estimators. Nature, 174, 270-271.
Hidiroglou, M. (2001). Double sampling. Surv. Methodol., 27, 143-154.
Hidiroglou, M., Beaumont, J.-F and Yung, W. (2019). Development of a small area estimation system at Statistics Canada. Surv. Methodol., 45, 101-126.
Holt, D. T. (2007). The official statistics Olympics challenge: Wider, deeper, quicker, better, cheaper. The American Statistician, 61, 1-8. With commentary by G. Brackstone and J. L. Norwood.
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc., 47, 6630685.
Kalton, G. (2019). Developments in survey research over the past 60 years: A personal perspective. Int. Stat. Rev., 87, S10-S30.
Keiding, N. and Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. J. R. Soc. Stat. Ser. A, 179, 319-376.
Kim, J. K. and Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Stat. Sin., 24, 375-394.
Kim, J. K. and Rao, J. N. K. (2012). Combining data from independent surveys: model-assisted approach. Biometrika, 99, 85-100.
Kim, J. K. and Tam, S-M. (2018). Data integration by combining big data and survey sample data for finite population inference. Submitted for publication.
Kim, J. K. and Wang, Z. (2019). Sampling techniques for big data analysts. Int. Stat. Rev. (in press).
Kim, J. K., Park, S., Chen, Y. and Wu, C. (2019). Combining non-probability and probability survey samples through mass imputation. Technical Report: arXiv: 1812. 10694v2 [stat.ME].
Lee, S. (2006). Propensity score adjustment as a weighting scheme for voluntary panel web surveys. J. Off. Stat., 22, 329-349.
Lee, S. and Valliant, R. (2009). Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol. Methods Res., 37, 319-343.
Little, R. J. (2015). Calibrated Bayes, an inferential paradigm for official statistics in the era of big data. Stat. J. IAOS, 31, 555-563.
Lohr, S. L. (2011). Alternative survey sample designs: Sampling with multiple overlapping frames. Surv. Methodol., 37, 197-213.
Lohr, S. L. and Raghunathan, T. E. (2017). Combining survey data with other data sources. Stat. Sci., 32, 293-312.
Mahalanobis, P. C. (1944). On large scale sample surveys. Philos. Trans. R. Soc. B, 231, 329-351.
Mahalanobis, P. C. (1946). Recent experiments in statistical sampling in the Indian Statistical Institute. J. R. Stat. Soc., 109, 325-378.
Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F., Pedreschi, D., Rinzivillo, S., Pappalardo, L and Gabrielli, L. (2015). Small area model-based estimators using big data sources. J. Off. Stat., 31, 263-281.
McConville, K. S. and Toth, D. (2018). Automated selection of post-strata using a model-assisted regression tree estimator. Scand. J. Stat. (in press).
McConville, K. S., Breidt, F. J., Lee, T. C. and Moisen, G. G. (2017). Model-assisted survey regression estimation with the lasso. J. Surv. Statist. Methodol., 5, 131-158.
Mcleod, A. I. AND Bellhouse, D. R. (1983). A convenient algorithm for drawing a simple random sample. Applied Statistics., 32, 182-184.
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat., 12, 685-726.
Mercer, A. W., Kreuter, F. and Stuart, E. A. (2017). Theory and practice in nonprobability surveys. Public Opin. Q., 81, 250-279.
Muhyi, F. A., Sartono, B., Sulvianti, I. D. and Kurnia, A. (2019). Twitter utilization in application of small area estimation to estimate electability of candidate central java governor. IOP Conf. Ser. Earth Environ. Sci., 299 012033, 1-10.
Narain, R. D. (1951). On sampling without replacement with varying probabilities. J. Indian Soc. Agric. Stat., 3, 169-174.
National Academies of Sciences, Engineering, and Medicine. (2017). Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. https://doi.org/10.17226/24893.
Neyman, J. (1934). On the two different approaches of the representative method. The method of stratified sampling and the method of purposive selection. J. R. Stat. Soc., 97, 558-606.
Pfeffermann, D. and Sverchkov, M. (2007). Small-area estimation under informative probability sampling of area and within the selected areas. J. Am. Stat. Assoc., 102, 1427-1439.
Porter, A. T., Holan, S. H., Wikle, C. K. and Cressie, N. (2014). Spatial Fay-Herriot model for small area estimation with functional covariates. Spat. Stat., 10, 27-42.
Rao, J. N. K. (1999). Some current trends in sample survey theory and methods (with discussion). Sankhya Ser. B, 61, 1-57.
Rao, J. N. K. and Fuller, W. A. (2017). Sample survey theory and methods: Past, present and future directions (with discussion). Surv. Methodol., 43, 145-181.
Rao, J. N.K. and Molina, I. (2015). Small Area Estimation. Wiley, Hoboken.
Reiter, J. (2008). Multiple imputation when records used for imputation are not used or disseminated for analysis. Biometrika, 95, 933-946.
Rivers, D. (2007). Sampling for web surveys. In 2007 JSM Proceedings, ASA Section on Survey Research Methods, American Statistical Association.
Royall, R. M. (1970). On finite population sampling under certain linear regression models. Biometrika, 57, 377-387.
Sarndal, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol., 33, 99-119.
Schenker, N. and Raghunathan, T. (2007). Combining information from multiple surveys to enhance estimation of measure of health. Stat. Med., 26, 1802-1811.
Schmid, T., Bruckschen, F., Salvati, N. and Zbiranski, T. (2017). Constructing sociodemographic indicators for national statistical institutes by using mobile phone data: estimating literacy rates in Senegal. J. R. Stat. Soc. Ser. A, 180, 1163-1190.
Singer, E. (2016). Reflections on surveys’ past and future. J. Surv. Statist. Methodol., 4, 463-475.
Singh, A. C., Beresovsky, V. and Ye, C. (2017). Estimation from purposive samples with the aid of probability supplements but without data on the study variable. In 2017 JSM Proceedings, ASA Section on the Survey Research Method Section, American Statistical Association.
Smith, T. M. F. (1983). On the validity of inferences from non-random samples. J. R. Stat. Soc. Ser. A, 146, 393-403.
Ta, T., Shao, J., Li, Q. and Wang, L. (2019). Generalized regression estimators with high-dimensional covariates. Stat. Sin. (in press).
Tam. S. M. and Kim, J. K. (2018). Big data, selection bias and ethics – an official statistician’s perspective. Stat. J. IAOS, 34, 577-588.
Thompson, S. K. (2002). Sampling. Wiley: New York.
Thompson, M. E. (2019). Combining data from new and traditional sources in population surveys. Int. Stat. Rev., 87, S79-S89.
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B, 58, 267-288.
Tourangeau, R., Brick, M. J., Lohr, S. and Li, J. (2017). Adaptive and responsive survey designs: a review and assessment. J. R. Stat. Soc. Ser. A, 180, 203-223.
Valliant, R. and Dever, J. A. (2011). Estiamting propensity adjustments for volunteer web surveys. Sociol. Methods Res., 40, 105-137.
Verret, F., Rao, J. N. K. and Hidiroglou, M. H. (2015). Model-based small area estimation under informative sampling. Surv. Methodol., 41, 333-347.
Wang, W., Rothschild, D., Goel, S. and Gelman, A. (2015). Forecasting elections with non-representative polls. Int. J. Forecast., 31, 980-991.
Williams, D. and Brick, M. J. (2018). Trends in U. S. face-to-face household survey nonresponse and level of effort. J. Surv. Statist. Methodol., 6, 186-211.
Woodruff, R. S. (1952). Confidence intervals for medians and other position measures. J. Am. Stat. Assoc., 47, 635-646.
Wu, C. and Sitter, R. R. (2001). A model-calibrated approach to using complete auxiliary information from survey data. J. Am. Stat. Assoc., 96, 185-193.
Yang, S., Kim, J. K. and Song, R. (2019). Doubly robust inference when combining probability and non-probability samples with high-dimensional data. Technical Report: arXiv: 1903.05212v1 [stat.ME].
Ybarra, L. M. R. and Lohr, S. L. (2008). Small area estimation when auxiliary information is measured with error. Biometrika, 95, 919-931.
Acknowledgement
This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada. I thank Jean-Francois Beaumont, Paul Biemer, Mike Brick, Wayne Fuller, Jack Gambino, Graham Kalton, Jae Kim, Frauke Kreuter, Sharon Lohr and Jean Opsomer for some useful comments and suggestions on my paper. I also thank two referees for constructive comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rao, J.N.K. On Making Valid Inferences by Integrating Data from Surveys and Other Sources. Sankhya B 83, 242–272 (2021). https://doi.org/10.1007/s13571-020-00227-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13571-020-00227-w