Abstract
Objectives
Similar to researchers in other disciplines, criminologists increasingly are using online crowdsourcing and opt-in panels for sampling, because of their low cost and convenience. However, online non-probability samples’ “fitness for use” will depend on the inference type and outcome variables of interest. Many studies use these samples to analyze relationships between variables. We explain how selection bias—when selection is a collider variable—and effect heterogeneity may undermine, respectively, the internal and external validity of relational inferences from crowdsourced and opt-in samples. We then examine whether such samples yield generalizable inferences about the correlates of criminal justice attitudes specifically.
Methods
We compare multivariate regression results from five online non-probability samples drawn either from Amazon Mechanical Turk or an opt-in panel to those from the General Social Survey (GSS). The online samples include more than 4500 respondents nationally and four outcome variables measuring criminal justice attitudes. We estimate identical models for the online non-probability and GSS samples.
Results
Regression coefficients in the online samples are normally in the same direction as the GSS coefficients, especially when they are statistically significant, but they differ considerably in magnitude; more than half (54%) fall outside the GSS’s 95% confidence interval.
Conclusions
Online non-probability samples appear useful for estimating the direction but not the magnitude of relationships between variables, at least absent effective model-based adjustments. However, adjusting only for demographics, either through weighting or statistical control, is insufficient. We recommend that researchers conduct both a provisional generalizability check and a model-specification test before using these samples to make relational inferences.
Similar content being viewed by others
Notes
There may also be fewer errors of observation in online web surveys because of the elimination of interviewer effects, less potential for social desirability bias, and higher quality responding (Chang and Krosnick 2009; Weinberg et al. 2014; Yeager et al. 2011). However, issues such as respondent nonnaïveté may lead to unique types of errors of observation that are especially problematic for these surveys (Chandler et al. 2014).
Although selection on X does not introduce bias, it does reduce efficiency and statistical power (Berk 1983). It also has different consequences for the bivariate correlation (rYX) and regression coefficient (bYX), because both variables are outcomes for the correlation (Sackett and Yang 2000). The Pearson correlation between two variables, X and Y, is simply the geometric mean of the slopes (bYX and bXY) from regressing Y on X and then X on Y.
The reason is that typically there are more possible sources of confounded sampling than of endogenous sampling. In an online study of death penalty support, for example, all common causes of SONS and death penalty support (e.g., race, gender, political ideology) would be confounders, whereas the only potential source of endogenous sampling would be death penalty support (or a variable caused by death penalty support).
The response rate of the 2016 GSS sample is not yet available, however, response rates have consistently hovered around 70% since the year 2000.
For more information about how the questionnaire items are administered, see Appendix Q of the General Social Survey (GSS), retrieved from http://gss.norc.org/DOCUMENTS/CODEBOOK/Q.pdf.
The analytic samples for the models estimated with the SurveyMonkey sample are much smaller than the full sample for two reasons. First, several hundred cases have item missing data on education. SurveyMonkey measured this variable at the profile stage of panel recruitment and provided it to us. Changes in the profiling process before our survey resulted in several hundred panelists lacking data on this pre-recorded variable. This data appears to be missing at random with respect to both outcomes—neither outcome differs significantly across respondents with versus without item missing data on education. Second, 288 respondents answered “don’t know” to the cappun question, and 101 to the fear question, and these responses are treated as missing in the analysis.
There is one small presentational difference in the law enforcement spending question asked in the MTurk17 and GSS samples. In the GSS respondents are asked, “We are faced with many problems in this country, none of which can be solved easily or inexpensively. I’m going to name some of these problems, and for each one I’d like you to tell me whether you think we're spending too much money on it, too little money, or about the right amount. First (READ ITEM A)… are we spending too much, too little, or about the right amount on (ITEM)?” Respondents are then asked to decide their spending preferences on a variety of issues. In the MTurk17 sample, it is a standalone question with the same introduction (i.e., respondents are not asked about spending on other topics).
For presentational purposes, we divided the continuous age variable by 50. This approach, suggested by one reviewer, makes it easier to see the differences across samples by widening the otherwise small confidence intervals.
To weight the GSS data we used the “WTSSALL” variable and adjusted for the geographic clustering of respondents with the “VSTRAT” and “VPSU” variables. We did this in Stata 15 using the following command: svyset [weight = wtssall], strata(vstrat) psu(vpsu) singleunit(scaled).
As Page and Shapiro (1992, p. 422) explain, “the evidence indicates that ‘house effects’ are mostly limited to one specific area: ‘don’t knows’ … Thus it is generally safe to compare identical questions across survey organizations, so long as one excludes ‘don’t knows’”.
Typically, to compare logistic regression coefficients across samples, we would need to use heterogeneous choice models to control for the confounding effects of group differences in residual variation (Williams 2009). But as one reviewer pointed out, the GSS and online samples are assumed to represent the same population, and thus we should not expect differences in residual variation across the samples absent selection bias.
In addition to the figures, tables comparing the weighted GSS and unweighted online estimates can be found in the online supplementary materials.
If we reverse the comparison, and focus instead on the confidence intervals in the online samples, we find that 22 out of 56 (or 39%) excluded the GSS regression coefficient.
We also estimated supplementary models with both the GSS and online samples unweighted (see Tables C1–C7 in the online supplement). The substantive conclusions were unchanged.
References
Ansolabehere S, Rivers D (2013) Cooperative survey research. Annu Rev Polit Sci 16:307–329
Baker R, Blumberg SJ, Brick JM, Couper MP, Courtright M, Dennis JM, Dillman D, Frankel MR, Garland P, Groves RM, Kennedy C, Krosnick JA, Lavrakas PJ, Lee S, Link M, Piekarski L, Rao K, Thomas RK, Zahs D (2010) Research synthesis: aAPOR report on online panels. Public Opin Q 74:711–781
Baker R, Brick JM, Bates NA, Battaglia M, Couper MP, Dever JA, Gile KJ, Tourangeau R (2013) Summary report of the AAPOR task force on non-probability sampling. J Sur Stat Methodol 1:90–143
Berk RA (1983) An introduction to sample selection bias in sociological data. Am Sociol Rev 48:386–398
Berk RA, Ray SC (1982) Selection biases in sociological data. Soc Sci Res 11:352–398
Berryessa CM (2018) The effects of psychiatric and “biological” labels on lay sentencing and punishment decisions. J Exp Criminol 14:241–256
Bhutta C (2012) Not by the book: facebook as a sampling frame. Sociol Methods Res 41:57–88
Blair J, Czaja RF, Blair EA (2013) Designing surveys: a guide to decisions and procedures. Sage, Thousand Oaks
Bollen KA, Biemer PP, Karr AF, Tueller S, Berzofsky ME (2016) Are survey weights needed? A review of diagnostic tests in regression analysis. Annu Rev Stat Appl 3:375–392
Brandon DM, Long JH, Loraas TM, Mueller-Phillips J, Vansant B (2013) Online instrument delivery and participant recruitment services: emerging opportunities for behavioral accounting research. Behav Res Account 26:1–23
Brown EK, Socia KM (2017) Twenty-first century punitiveness: social sources of punitive American views reconsidered. J Quant Criminol 33:935–959
Bullock JG, Green DP, Ha SE (2010) Yes, but what’s the mechanism? (don’t expect an easy answer). J Pers Soc Psychol 98:550–558
Callegaro M, Villar A, Krosnick J, Yeager D (2014) A critical review of studies investigating the quality of data obtained with online panels. In: Callegaro M, Baker R, Bethlehem J, Goritz A, Krosnick J, Lavrakas P (eds) Online panel research: a data quality perspective. Wiley, New York, pp 23–53
Callegaro M, Manfreda KL, Vehovar V (2015) Web survey methodology. Sage, Thousand Oaks
Casey LS, Chandler J, Levine AS, Proctor A, Strolovitch DZ (2017) Intertemporal differences among MTurk workers: time-based sample variations and implications for online data collection. SAGE Open 7:1–15
Chandler J, Shapiro D (2016) Conducting clinical research using crowdsourced convenience samples. Annu Rev Clin Psychol 12:53–81
Chandler J, Mueller P, Paolacci G (2014) Nonnaïveté among Amazon Mechanical Turk workers: consequences and solutions for behavioral researchers. Behav Res Methods 46:112–130
Chang L, Krosnick JA (2009) National surveys via RDD telephone interviewing versus the internet: comparing sample representativeness and response quality. Public Opin Q 73:641–678
Couper MP (2011) The future of modes of data collection. Public Opin Q 75:889–908
Denver M, Pickett JT, Bushway SD (2017) Criminal records and employment: a survey of experiences and attitudes in the United States. Justice Q 35:584–613
Dum CP, Socia KM, Rydberg J (2017) Public support for emergency shelter housing interventions concerning stigmatized populations. Criminol Public Policy 16:835–877
DuMouchel WH, Duncan GJ (1983) Using sample survey weights in multiple regression analyses of stratified samples. J Am Stat Assoc 75:535–543
Elliott MR, Valliant R (2017) Inference for nonprobability samples. Stat Sci 32:249–264
Elwert F, Winship C (2014) Endogenous selection bias: the problem of conditioning on a collider variable. Annu Rev Sociol 40:31–53
Enns PK, Ramirez M (2018) Privatizing punishment: testing theories of public support for private prison and immigration detention facilities. Criminology 56:546–573
ESOMAR 28: Surveymonkey Audience (2013) European Society for Opinion and Marketing Research, Amsterdam. https://www.esomar.org/
Gelman A (2007) Struggles with survey weighting and regression modeling. Stat Sci 22:153–164
Gelman A, Carlin JB (2002) Postratification and weighting adjustments. In: Groves RM, Dillman DA, Eltinge JL, Little RJA (eds) Survey nonresponse. Wiley, New York, pp 289–302
Gottlieb A (2017) The effect of message frames on public attitudes toward criminal justice reform for nonviolent offenses. Crime Delinq 63:636–656
Greenland S (2003) Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiol 14:300–306
Groves RM, Fowler FJ, Couper MP, Lepkowski J, Singer E, Tourangeau R (2009) Survey methodology, 2nd edn. Wiley, Hoboken
Holbert RL, Shah DV, Kwak N (2004) Fear, authority, and justice: crime-related TV viewing and endorsements of capital punishment and gun ownership. Journal Mass Commun Q 81:343–363
Horton JJ, Rand DG, Zeckhauser RJ (2011) The online laboratory: conducting experiments in a real labor market. Exp Econ 14:399–425
Hox JJ, De Leeuw ED, Zijlmans EA (2015) Measurement equivalence in mixed mode surveys. Front Psychol 6:1–10
Johnson D (2009) Anger about crime and support for punitive criminal justice policies. Punishm Soc 11:51–66
Johnson D, Kuhns JB (2009) Striking out: race and support for police use of force. Justice Q 26:592–623
Jones DN, Olderbak SG (2014) The associations among dark personalities and sexual tactics across different scenarios. J Interp Viol 29:1050–1070
Keeter S, McGeeney K, Mercer A, Hatley N, Pattern E, Perrin A (2015) Coverage error in internet surveys. Pew Research Center, Washington. Retrieved from https://www.pewresearch.org/methods/2015/09/22/coverage-error-in-internet-surveys/
King RD, Wheelock D (2007) Group threat and social control: race, perceptions of minorities and the desire to punish. Soc Forces 85:1255–1280
Lageson SE, McElrath S, Palmer KE (2018) Gendered public support for criminalizing “Revenge Porn”. Feminist Criminol 14:560–583
Lehmann PS, Pickett JT (2017) Experience versus expectation: economic insecurity, the Great Recession, and support for the death penalty. Justice Q 34:873–902
Levay KE, Freese J, Druckman JN (2016) The demographic and political composition of Mechanical Turk samples. Sage Open 6:1–17
Little A, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
Mercer AW, Kreuter F, Keeter S, Stuart EA (2017) Theory and practice in nonprobability surveys: parallels between causal inference and survey inference. Public Opin Q 81:250–271
Mercer A, Lau A, Kennedy C (2018) For weighting online opt-in samples, what matters most?. Pew Research Center, Washington
Morgan SL, Winship C (2015) Counterfactuals and causal inference. Cambridge University Press, Oxford
Mullinix KJ, Leeper TJ, Druckman JN, Free se J (2015) The generalizability of survey experiments. J Exp Pol Sci 2:109–138
Nicolaas G, Calderwood L, Lynn P, Roberts C (2014) Web surveys for the general population: How, why and when?. National Centre for Research Methods, Southampton. Retrieved from http://eprints.ncrm.ac.uk/3309/3/GenPopWeb.pdf
Open Science Collaboration (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716
Page BI, Shapiro RY (1992) The rational public: fifty years of trends in Americans’ policy preferences. Chicago University Press, Chicago
Pasek J (2016) When will nonprobability surveys mirror probability surveys? Considering types of inference and weighting strategies as criteria for correspondence. Int J Public Opin Res 28:269–291
Pasek J, Krosnick JA (2010) Measuring intent to participate and participation in the 2010 census and their correlates and trends: comparisons of RDD telephone and non–probability sample internet survey data. Statistical Research Division of the US Census Bureau, Washington. Retrieved from https://www.mod.gu.se/digitalAssets/1456/1456661_pasek-krosnick-mode-census.pdf
Peer E, Vosgerau J, Acquisti A (2014) Reputation as a sufficient condition for data quality on Amazon Mechanical Turk. Behav Res Methods 46:1023–1031
Peffley M, Hurwitz J (2007) Persuasion and resistance: race and the death penalty in America. Am J Pol Sci 51:996–1012
Peytchev A (2009) Survey breakoff. Public Opin Q 73:74–97
Peytchev A (2011) Breakoff and unit nonresponse across web surveys. J Off Stat 27:33–47
Pfeffermann D (1993) The role of sampling weights when modeling survey data. Int Stat Rev 61:317–337
Pickett JT (2016) On the social foundations for crimmigration: latino threat and support for expanded police powers. J Quant Criminol 32:103–132
Pickett JT, Mancini C, Mears DP (2013) Vulnerable victims, monstrous offenders, and unmanageable risk: explaining public opinion on the social control of sex crime. Criminology 51:729–759
Pickett JT, Cullen F, Bushway SD, Chiricos T, Alpert G (2018) The response rate test: nonresponse bias and the future of survey research in criminology and criminal justice. Criminologist 43:7–11
Rivers D (2007) Sampling for web surveys. Joint Statistical Meetings, Salt Lake
Roche SP, Pickett JT, Gertz M (2016) The scary world of online news? Internet news exposure and public attitdues toward crime and justice. J Quant Criminol 32:215–236
Ross J, Irani L, Silberman M, Zaldivar A, Tomlinson B (2010) Who are the crowdworkers? Shifting demographics in Mechanical Turk. In: Edwards K, Rodden T, Proceedings of the ACM conference on human factors in computing systems. ACM, New York
Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J Educ Psychol 66:688
Sackett PR, Yang H (2000) Correction for range restriction: an expanded typology. J Appl Psychol 85:112–118
Shadish W, Cook TD, Campbell DT (2002) Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin, Boston
Shadish WR, Clark MH, Steiner PM (2008) Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. J Am Stat Assoc 103:1334–1344
Sheehan KB, Pittman M (2016) Amazon’s Mechanical Turk for academics: The HIT handbook for social science research. Melvin and Leigh, Irvine
Silver JR, Pickett JT (2015) Toward a better understanding of politicized policing attitudes: conflicted conservatism and support for police use of force. Criminology 53:650–676
Silver JR, Silver E (2017) Why are conservatives more punitive than liberals? A moral foundations approach. Law Human Behav 41:258–272
Simmons AD (2017) Cultivating support for punitive criminal justice policies: news sectors and the moderating effects of audience characteristics. Soc Forces 96:299–328
Simmons AD, Bobo LD (2015) Can non-full-probability internet surveys yield useful data? A comparison with full-probability face-to-face surveys in the domain of race and social inequality attitudes. Sociol Methodol 45:357–387
Solon G, Haider SJ, Wooldridge JM (2015) What are we weighting for? J Hum Resour 50:301–316
Stewart N, Chandler J, Paolacci G (2017) Crowdsourcing samples in cognitive science. Trends Cogn Sci 21:736–748
Tourangeau R, Yan T (2007) Sensitive questions in surveys. Psychol Bull 133(5):859
Tourangeau R, Frederick G, Conrad FG, Couper MP (2013) The science of web surveys. Oxford University Press, Oxford
Unnever JD, Cullen FT (2010) The social sources of Americans’ punitiveness: a test of three competing models. Criminology 48:99–129
Unnever JD, Cullen FT, Jonson CL (2008) Race, racism, and support for capital punishment. Crime Justice 37:45–96
Valliant R, Dever JA (2011) Estimating propensity adjustments for volunteer web surveys. Sociol Methods Res 40:105–137
Vaughan TJ, Holleran LB, Silver J (2019) Applying moral foundations theory to the explanation of capital jurors’ sentencing decisions. Justice Q. https://doi.org/10.1080/07418825.2018.1537400
Wang W, Rothschild D, Goel S, Gelman A (2015) Forecasting elections with non-representative polls. Int J Forecast 31:980–991
Weinberg JD, Freese J, McElhattan D (2014) Comparing data characteristics and results of an online factorial survey between a population-based and a crowdsource-recruited sample. Sociol Sci 1:292–310
Williams R (2009) Using heterogeneous choice models to compare logit and probit coefficients across groups. Sociol Methods Res 37:531–559
Winship C, Radbill L (1994) Sampling weights and regression analysis. Soc Methods Res 23:230–257
Yeager DS, Krosnick JA, Chang L, Javitz HS, Levendusky MS, Simpser A, Wang R (2011) Comparing the accuracy of RDD telephone surveys and internet surveys conducted with probability and non-probability samples. Public Opin Q 75:709–747
Zhou H, Fishbach A (2016) The pitfall of experimenting on the web: how unattended selective attrition leads to surprising (yet false) research conclusions. J Pers Soc Psychol 111:493–504
Acknowledgements
The authors thank Jasmine Silver, Sean Roche, Luzi Shi, Megan Denver, and Shawn Bushway for their help collecting data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Thompson, A.J., Pickett, J.T. Are Relational Inferences from Crowdsourced and Opt-in Samples Generalizable? Comparing Criminal Justice Attitudes in the GSS and Five Online Samples. J Quant Criminol 36, 907–932 (2020). https://doi.org/10.1007/s10940-019-09436-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10940-019-09436-7