On Making Valid Inferences by Integrating Data from Surveys and Other Sources

Rao, J. N. K.

doi:10.1007/s13571-020-00227-w

On Making Valid Inferences by Integrating Data from Surveys and Other Sources

Published: 03 April 2020

Volume 83, pages 242–272, (2021)
Cite this article

Sankhya B Aims and scope Submit manuscript

J. N. K. Rao¹

1533 Accesses
27 Citations
3 Altmetric
Explore all metrics

Abstract

Survey samplers have long been using probability samples from one or more sources in conjunction with census and administrative data to make valid and efficient inferences on finite population parameters. This topic has received a lot of attention more recently in the context of data from non-probability samples such as transaction data, web surveys and social media data. In this paper, I will provide a brief overview of probability sampling methods first and then discuss some recent methods, based on models for the non-probability samples, which could lead to useful inferences from a non-probability sample by itself or when combined with a probability sample. I will also explain how big data may be used as predictors in small area estimation, a topic of current interest because of the growing demand for reliable local area statistics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

No Calculation When Observation Can Be Made

Semiparametric Bayesian Small Area Estimation Based on Dirichlet Process Priors

Statistical data integration in survey sampling: a review

Article Open access 15 October 2020

References

Baker, R., Brick, J. M., Bates, N. A., Battaglia, M., Couper, M. P., Dever, J. A., Gile, K. J. and Tourangeau, R. (2013). Report of the AAPOR task force on non-probability sampling. J. Surv. Statist. Methodol., 1, 90-143.
Google Scholar
Battese, G. E., Harter, R. M. and Fuller, W. A. (1988). An error component model for prediction of county crop areas using survey and satellite data. J. Am. Stat. Assoc., 83, 28-36.
Google Scholar
Beaumont, J. – F. (2019). Are probability surveys bound to disappear for the production of official statistics? Technical Report. Statistics Canada.
Bethlehem, J. (2016). Solving the nonresponse problem with sample matching. Soc. Sci. Comput. Rev., 34, 59-77.
Google Scholar
Biemer, P. P. (2018). Quality of official statistics: present and future. Paper presented at the International Methodology Symposium. Statistics Canada, Ottawa.
Google Scholar
Bose, C. (1943). Note on the sampling error in the method of double sampling. Sankhya, 6, 329-330.
MATH MathSciNet Google Scholar
Brakel Van Den, J. A. and Bethlehem, J. (2008). Model-assisted estimators for official statistics. Discussion Paper 09002, Statistics Netherland.
Breidt, F. J. and Opsomer, J. D. (2017). Model-assisted survey estimation with modern prediction techniques. Stat. Sci., 32, 190-205.
MATH MathSciNet Google Scholar
Brick, M. J. (2011). The future of survey sampling. Public Opin. Q., 75, 872-888.
Google Scholar
Chambers, R. L., Fabrizi, E. and Salvati, N. (2019). Small area estimation with linked data. Technical report appeared as arXiv: 1904.00364v1.
Chaudhuri, A. and Christofides, T. (2013). Indirect Questioning in Sample. Springer: New York.
MATH Google Scholar
Chen, S. and Haziza, D. (2017). Multiply robust imputation procedures for the treatment of item nonresponse in surveys. Biometrika, 104, 439-453.
MATH MathSciNet Google Scholar
Chen, Y., Li, P. and Wu, C. (2018a). Doubly robust inference with non-probability survey samples. Technical Report: arXiv: 1805.06432v1 [stat. ME].
Chen, J. K. T., Valliant, R. L. and Elliott, M. R. (2018b). Model-assisted calibration of non-probability sample survey data using adaptive LASSO. Surv. Methodol., 44, 117-144.
Google Scholar
Citro, C. (2014). From multiple modes for surveys to multiple data sources for estimates. Surv. Methodol., 40, 137-161.
Google Scholar
Cochran, W. G. (1977). Sampling Techniques, 3rd Edition, Wiley: New York.
MATH Google Scholar
Couper, M. P. (2013). Is the sky falling? New technology changing media, and the future of surveys. Surv. Res. Methods, 7, 145-156.
Google Scholar
Leeuw, E. D. de (2005). To mix or not to mix. Data collection modes for surveys. J. Off. Stat., 21, 233-255.
Google Scholar
Deville, J. C. and Sarndal, C. E. (1992). Calibration estimators in survey sampling. J. Am. Stat. Assoc., 87, 376-382.
MATH MathSciNet Google Scholar
Elliott, M. R. and Valliant, R. (2017). Inference for nonprobability samples. Stat. Sci., 32, 249-264.
MATH MathSciNet Google Scholar
European Statistical System (2015). ESS Handbook for Quality Reports, 2014 Edition. Luxembourg: Publications Office of the European Union. Available at https://ec.europa.eu/eurostat/documents/3859598/6651706/KS-GQ-15-003-EN-N.pdf.
Google Scholar
Fay, R. E. and Herriot, R. A. (1979). Estimation of income for small places: An application of James-Stein procedures to census data. J. Am. Stat. Assoc., 74, 269-277.
Google Scholar
Federal Committee on Statistical Methodology (2018). Transparent Quality Reporting in the Integration of Multiple Data Sources: A Progress Report, 2017-2018. Washington, DC: Federal Committee on Statistical Methodology. Available at https://nces.ed.gov/FCSM/pdf/Quality_Integrated_Data.pdf.
Google Scholar
Fuller, W. A. (1975). Regression analysis for sample survey. Sankhya Ser. C., 31, 117-132.
MATH Google Scholar
Groves, R. M. (2011). Three eras of survey research. Public Opin. Q., 75, 861-871 (Special 75^th Anniversary Issue).
Google Scholar
Groves, R. M. and Heeringa, S. G. (2006). Responsive design for household surveys: Tools for actively controlling survey errors and costs. J. R. Stat. Soc. Ser. A, 169, 439-457.
MathSciNet Google Scholar
Guandalini, A. and Tille, Y. (2017). Design-based estimators calibrated on estimated totals from multiple surveys. Int. Stat. Rev., 85, 250-269.
MathSciNet Google Scholar
Hall, P. (2003). A short prehistory of the bootstrap. Stat. Sci., 18, 158-167.
MATH MathSciNet Google Scholar
Hansen, M. H. and Hurwitz, W. N. (1943). On the theory of sampling from finite populations. Ann. Math. Stat., 14, 333-362.
MATH MathSciNet Google Scholar
Hansen, M. H. and Hurwitz, W. N. (1946). The problem of non-response in sample surveys. J. Am. Stat. Assoc., 41, 517-529.
Google Scholar
Hansen, M. H., Hurwitz, W. N., Marks, E. S. and Mauldin, W. P. (1951). Response errors in surveys. J. Am. Stat. Assoc., 46, 147-190.
MATH Google Scholar
Hansen, M. H., Hurwitz, W. N., Nisselson, H. and Steinberg, J. (1955). The redesign of the census current population survey. J. Am. Stat. Assoc., 50, 701-719.
Google Scholar
Hansen, M. H., Madow, W.G. and Tepping, B. J. (1983). An evaluation of model-dependent and probability sampling inferences in sample surveys. J. Am. Stat. Assoc., 78, 776-793.
Hartley, H. O. (1962). Multiple frame surveys. Proceedings of the Social Statistics Section, American Statistical Association, 203-206.
Hartley, H. O. and Ross, A. (1954). Unbiased ratio estimators. Nature, 174, 270-271.
Google Scholar
Hidiroglou, M. (2001). Double sampling. Surv. Methodol., 27, 143-154.
Google Scholar
Hidiroglou, M., Beaumont, J.-F and Yung, W. (2019). Development of a small area estimation system at Statistics Canada. Surv. Methodol., 45, 101-126.
Google Scholar
Holt, D. T. (2007). The official statistics Olympics challenge: Wider, deeper, quicker, better, cheaper. The American Statistician, 61, 1-8. With commentary by G. Brackstone and J. L. Norwood.
Google Scholar
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc., 47, 6630685.
MATH MathSciNet Google Scholar
Kalton, G. (2019). Developments in survey research over the past 60 years: A personal perspective. Int. Stat. Rev., 87, S10-S30.
MathSciNet Google Scholar
Keiding, N. and Louis, T. A. (2016). Perils and potentials of self-selected entry to epidemiological studies and surveys. J. R. Soc. Stat. Ser. A, 179, 319-376.
MathSciNet Google Scholar
Kim, J. K. and Haziza, D. (2014). Doubly robust inference with missing data in survey sampling. Stat. Sin., 24, 375-394.
MATH MathSciNet Google Scholar
Kim, J. K. and Rao, J. N. K. (2012). Combining data from independent surveys: model-assisted approach. Biometrika, 99, 85-100.
MATH MathSciNet Google Scholar
Kim, J. K. and Tam, S-M. (2018). Data integration by combining big data and survey sample data for finite population inference. Submitted for publication.
Kim, J. K. and Wang, Z. (2019). Sampling techniques for big data analysts. Int. Stat. Rev. (in press).
Kim, J. K., Park, S., Chen, Y. and Wu, C. (2019). Combining non-probability and probability survey samples through mass imputation. Technical Report: arXiv: 1812. 10694v2 [stat.ME].
Lee, S. (2006). Propensity score adjustment as a weighting scheme for voluntary panel web surveys. J. Off. Stat., 22, 329-349.
Google Scholar
Lee, S. and Valliant, R. (2009). Estimation for volunteer panel web surveys using propensity score adjustment and calibration adjustment. Sociol. Methods Res., 37, 319-343.
MathSciNet Google Scholar
Little, R. J. (2015). Calibrated Bayes, an inferential paradigm for official statistics in the era of big data. Stat. J. IAOS, 31, 555-563.
Google Scholar
Lohr, S. L. (2011). Alternative survey sample designs: Sampling with multiple overlapping frames. Surv. Methodol., 37, 197-213.
Google Scholar
Lohr, S. L. and Raghunathan, T. E. (2017). Combining survey data with other data sources. Stat. Sci., 32, 293-312.
MATH MathSciNet Google Scholar
Mahalanobis, P. C. (1944). On large scale sample surveys. Philos. Trans. R. Soc. B, 231, 329-351.
Google Scholar
Mahalanobis, P. C. (1946). Recent experiments in statistical sampling in the Indian Statistical Institute. J. R. Stat. Soc., 109, 325-378.
Google Scholar
Marchetti, S., Giusti, C., Pratesi, M., Salvati, N., Giannotti, F., Pedreschi, D., Rinzivillo, S., Pappalardo, L and Gabrielli, L. (2015). Small area model-based estimators using big data sources. J. Off. Stat., 31, 263-281.
Google Scholar
McConville, K. S. and Toth, D. (2018). Automated selection of post-strata using a model-assisted regression tree estimator. Scand. J. Stat. (in press).
McConville, K. S., Breidt, F. J., Lee, T. C. and Moisen, G. G. (2017). Model-assisted survey regression estimation with the lasso. J. Surv. Statist. Methodol., 5, 131-158.
Google Scholar
Mcleod, A. I. AND Bellhouse, D. R. (1983). A convenient algorithm for drawing a simple random sample. Applied Statistics., 32, 182-184.
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat., 12, 685-726.
MATH MathSciNet Google Scholar
Mercer, A. W., Kreuter, F. and Stuart, E. A. (2017). Theory and practice in nonprobability surveys. Public Opin. Q., 81, 250-279.
Google Scholar
Muhyi, F. A., Sartono, B., Sulvianti, I. D. and Kurnia, A. (2019). Twitter utilization in application of small area estimation to estimate electability of candidate central java governor. IOP Conf. Ser. Earth Environ. Sci., 299 012033, 1-10.
Google Scholar
Narain, R. D. (1951). On sampling without replacement with varying probabilities. J. Indian Soc. Agric. Stat., 3, 169-174.
MathSciNet Google Scholar
National Academies of Sciences, Engineering, and Medicine. (2017). Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. https://doi.org/10.17226/24893.
Book Google Scholar
Neyman, J. (1934). On the two different approaches of the representative method. The method of stratified sampling and the method of purposive selection. J. R. Stat. Soc., 97, 558-606.
MATH Google Scholar
Pfeffermann, D. and Sverchkov, M. (2007). Small-area estimation under informative probability sampling of area and within the selected areas. J. Am. Stat. Assoc., 102, 1427-1439.
MATH MathSciNet Google Scholar
Porter, A. T., Holan, S. H., Wikle, C. K. and Cressie, N. (2014). Spatial Fay-Herriot model for small area estimation with functional covariates. Spat. Stat., 10, 27-42.
MathSciNet Google Scholar
Rao, J. N. K. (1999). Some current trends in sample survey theory and methods (with discussion). Sankhya Ser. B, 61, 1-57.
MATH MathSciNet Google Scholar
Rao, J. N. K. and Fuller, W. A. (2017). Sample survey theory and methods: Past, present and future directions (with discussion). Surv. Methodol., 43, 145-181.
Google Scholar
Rao, J. N.K. and Molina, I. (2015). Small Area Estimation. Wiley, Hoboken.
MATH Google Scholar
Reiter, J. (2008). Multiple imputation when records used for imputation are not used or disseminated for analysis. Biometrika, 95, 933-946.
MATH MathSciNet Google Scholar
Rivers, D. (2007). Sampling for web surveys. In 2007 JSM Proceedings, ASA Section on Survey Research Methods, American Statistical Association.
Royall, R. M. (1970). On finite population sampling under certain linear regression models. Biometrika, 57, 377-387.
MATH Google Scholar
Sarndal, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol., 33, 99-119.
Google Scholar
Schenker, N. and Raghunathan, T. (2007). Combining information from multiple surveys to enhance estimation of measure of health. Stat. Med., 26, 1802-1811.
MathSciNet Google Scholar
Schmid, T., Bruckschen, F., Salvati, N. and Zbiranski, T. (2017). Constructing sociodemographic indicators for national statistical institutes by using mobile phone data: estimating literacy rates in Senegal. J. R. Stat. Soc. Ser. A, 180, 1163-1190.
MathSciNet Google Scholar
Singer, E. (2016). Reflections on surveys’ past and future. J. Surv. Statist. Methodol., 4, 463-475.
Google Scholar
Singh, A. C., Beresovsky, V. and Ye, C. (2017). Estimation from purposive samples with the aid of probability supplements but without data on the study variable. In 2017 JSM Proceedings, ASA Section on the Survey Research Method Section, American Statistical Association.
Smith, T. M. F. (1983). On the validity of inferences from non-random samples. J. R. Stat. Soc. Ser. A, 146, 393-403.
Google Scholar
Ta, T., Shao, J., Li, Q. and Wang, L. (2019). Generalized regression estimators with high-dimensional covariates. Stat. Sin. (in press).
Tam. S. M. and Kim, J. K. (2018). Big data, selection bias and ethics – an official statistician’s perspective. Stat. J. IAOS, 34, 577-588.
Thompson, S. K. (2002). Sampling. Wiley: New York.
MATH Google Scholar
Thompson, M. E. (2019). Combining data from new and traditional sources in population surveys. Int. Stat. Rev., 87, S79-S89.
MathSciNet Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B, 58, 267-288.
MATH MathSciNet Google Scholar
Tourangeau, R., Brick, M. J., Lohr, S. and Li, J. (2017). Adaptive and responsive survey designs: a review and assessment. J. R. Stat. Soc. Ser. A, 180, 203-223.
MathSciNet Google Scholar
Valliant, R. and Dever, J. A. (2011). Estiamting propensity adjustments for volunteer web surveys. Sociol. Methods Res., 40, 105-137.
MathSciNet Google Scholar
Verret, F., Rao, J. N. K. and Hidiroglou, M. H. (2015). Model-based small area estimation under informative sampling. Surv. Methodol., 41, 333-347.
Google Scholar
Wang, W., Rothschild, D., Goel, S. and Gelman, A. (2015). Forecasting elections with non-representative polls. Int. J. Forecast., 31, 980-991.
Google Scholar
Williams, D. and Brick, M. J. (2018). Trends in U. S. face-to-face household survey nonresponse and level of effort. J. Surv. Statist. Methodol., 6, 186-211.
Google Scholar
Woodruff, R. S. (1952). Confidence intervals for medians and other position measures. J. Am. Stat. Assoc., 47, 635-646.
MATH MathSciNet Google Scholar
Wu, C. and Sitter, R. R. (2001). A model-calibrated approach to using complete auxiliary information from survey data. J. Am. Stat. Assoc., 96, 185-193.
MATH Google Scholar
Yang, S., Kim, J. K. and Song, R. (2019). Doubly robust inference when combining probability and non-probability samples with high-dimensional data. Technical Report: arXiv: 1903.05212v1 [stat.ME].
Ybarra, L. M. R. and Lohr, S. L. (2008). Small area estimation when auxiliary information is measured with error. Biometrika, 95, 919-931.
MATH MathSciNet Google Scholar

Download references

Acknowledgement

This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada. I thank Jean-Francois Beaumont, Paul Biemer, Mike Brick, Wayne Fuller, Jack Gambino, Graham Kalton, Jae Kim, Frauke Kreuter, Sharon Lohr and Jean Opsomer for some useful comments and suggestions on my paper. I also thank two referees for constructive comments.

Author information

Authors and Affiliations

Carleton University, Ottawa, Canada
J. N. K. Rao

Authors

J. N. K. Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. N. K. Rao.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, J.N.K. On Making Valid Inferences by Integrating Data from Surveys and Other Sources. Sankhya B 83, 242–272 (2021). https://doi.org/10.1007/s13571-020-00227-w

Download citation

Published: 03 April 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s13571-020-00227-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Making Valid Inferences by Integrating Data from Surveys and Other Sources

Abstract

Access this article

Similar content being viewed by others

No Calculation When Observation Can Be Made

Semiparametric Bayesian Small Area Estimation Based on Dirichlet Process Priors

Statistical data integration in survey sampling: a review

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On Making Valid Inferences by Integrating Data from Surveys and Other Sources

Abstract

Access this article

Similar content being viewed by others

No Calculation When Observation Can Be Made

Semiparametric Bayesian Small Area Estimation Based on Dirichlet Process Priors

Statistical data integration in survey sampling: a review

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation