Abstract
Objective
The increasing availability of large administrative datasets has led to an exciting innovation in criminal justice research—using administrative data to measure experimental outcomes in lieu of costly primary data collection. We demonstrate that this type of randomized experiment can have an unfortunate consequence: the destruction of statistical power. Combining experimental data with administrative records to track outcomes of interest typically requires linking datasets without a common identifier. In order to minimize mistaken linkages, researchers often use stringent linking rules like “exact matching” to ensure that speculative matches do not lead to errors in an analytic dataset. We show that this, seemingly conservative, approach leads to underpowered experiments, leaves real treatment effects undetected, and can therefore have profound implications for entire experimental literatures.
Methods
We derive an analytic result for the consequences of linking errors on statistical power and show how the problem varies across combinations of relevant inputs, including linking error rate, outcome density and sample size.
Results
Given that few experiments are overly wellpowered, even small amounts of linking error can have considerable impact on Type II error rates. In contrast to exact matching, machine learningbased probabilistic matching algorithms allow researchers to recover a considerable share of the statistical power lost under stringent datalinking rules.
Conclusion
Our results demonstrate that probabilistic linking substantially outperforms stringent linking criteria. Failure to implement linking procedures designed to reduce linking errors can have dire consequences for subsequent analyses and, more broadly, for the viability of this type of experimental research.
This is a preview of subscription content, access via your institution.
Notes
For important reviews of the limitations of experimental research especially with respect to external validity, see Berk (2005), Deaton (2010), Heckman and Smith (1995) and Sampson (2010). Also see Nagin and Sampson (2019) for a wonderfully nuanced and equally important discussion of the inherent challenges in identifying a policyrelevant counterfactual in an experimental design. For reviews of the ethical and legal considerations that are attendant in randomized experiments, we refer readers to thoughtful reviews by Boruch et al. (2000) and Weisburd (2003).
Formerly known as the Laura and John Arnold Foundation.
Statistical power is, in large part a function of the available sample size but also depends on the amount of variation in the treatment and outcome variables. As was noted by Weisburd et al. (1993) some twentyfive years ago and noted recently by Nelson et al. (2015), small N studies are not necessarily more poorly powered than larger N studies empirically though, other things equal, this will be the case.
Concerns over the misuse of researcher degrees of freedom and specification searching have likewise spurred recommendations which include the use of very small \(\alpha\) levels (Benjamin et al. 2018), which increases the probability of Type II errors even more.
A second advantage of administrative data is that it avoids the inherent challenges involved in working with self reported data (Bertrand and Mullainathan 2001), which is not to minimize the fact that there are certainly tradeoffs to using administrative data relative to selfreports. For empirical evaluations of the validity of selfreported data see: Lauritsen (1999), Morris and Slocum (2010) and Roberts and Wells (2010) among others.
From 2017present, approximately 16% of the field experiments published in Criminology, Journal of Quantitative Criminology, Journal of Experimental Criminology, Journal of Research in Crime and Delinquency and Justice Quarterly have taken the general form of a “lowcost” RCT.
When a unique identifier is available in all of the datasets that require linking and the data are of sufficient quality, linking can, in some cases, be fairly trivial. These types of crosssystem unique identifiers are frequently available in Scandinavian countries (e.g. Black et al. 2005; Dahl et al. 2014; Lyngstad and Skardhamar 2011; Wildeman and Andersen 2017) and occationally in South America.
A large literature considers the implications that measurement error can have for econometric models but, to our knowledge, there is considerably less formal guidance with respect to how bad data linking can confound randomized experiments. It is also worth noting that when scholars need to link datasets without a common identifier there is no “ground truth” to assess the quality of the match. Likewise, there is often no prior about what the match rate should be, rendering it difficult to diagnose whether the matching procedure employed is sufficient or not.
For narrative clarity, we limit our discussion to the linking of data containing records on persons. This discussion would extend to groups or firms, but the characteristics available for linking might be different.
We acknowledge that biometric data are susceptible to misidentification as well. However, the literature generally considers linking using biometric indicators as more accurate than the textbased demographic identifiers that we discuss below (Watson et al. 2014).
Operationally, however, the end result of most probabilistic linking processes requires the imposition of a deterministic threshold to define potential pairs as links, nonlinks or, in some cases, potential links.
In the FellegiSunter framework, pairs of records are compared across their identifying characteristics (name, date of birth, gender, etc) and a comparison vector is computed which encodes whether, for example, the names in the two records are the same, the name in one record is missing, the date of births are the same, and so on. Other extensions to this framework include string distance calculations between names (e.g. levenshtein, jarowinkler, etc.) or phonetic computation (e.g., Soundex, Double Metaphone, etc.).
In particular, the aim is to estimate the difference, possibly conditioned on covariates, in means between treatment and control groups in a randomized control trial. In a related paper, Moore et al. (2014) explore the impact of matching errors on the relative risk ratio. Matching errors bias these two quantities in different ways. As we show below, false positive and false negative rates have equal impact on bias in our scenario. Moore et al. show that false positive rates are more influential on the bias in relative risk ratio estimates.
Recent specific examples from throughout the social sciences include program participation in Supplemental Nutritional Assistance Program (Courtemanche et al. 2018), employment prevalence measured through unemployment insurance wage records (Johnston and Mas 2018), injuries measured through hospitalization data (Powell and Seabury 2018), or financial health measured through bankruptcy or liens (Dobkin et al. 2018).
One scenario where this assumption would not hold is if both linking errors and treatment effects vary by one or more subgroups. In the event that both treatment effects and false linkage rates both vary by subgroup, the solution is slightly more complex and is explored in "Appendix 5".
If \(TPR = FPR\) then the previous equation is undefined and the observed treatment effect will equal zero, but that situation is unlikely to occur in practice as it implies a random match.
Ioannidis et al. (2017) show that the median statistical power for a large body of studies in economics, most of them observational, is just 18%.
Here, \({\tau _h}\) refers to the candidate treatment effect for which statistical power will be computed. For smaller samples, \({{\Phi }}\) would be replaced by the cumulative distribution function for the t distribution.
The computational details of this exercise are described in Appendix "1" to this paper.
The parameter r captures the strength of the relationship between X and \(y_{C}^{*}\). Therefore, as r increases in magnitude, statistical power increases, both in the absence and the presence of matching errors. However, the relative gain statistical power is slightly larger when we do not condition on X. Across the parameterizations we examine, in the absence of a covariate, the average loss of power under matching errors is 8.4%. When r = 0.1, the loss of power is 8.8% when X is conditioned on. When r = 0.3, the average loss of power under matching errors is 11.9% when X is conditioned on. Hence while a larger r is uniformly power enhancing, it does mean that controlling for a covariate will be slightly less helpful in maximizing statistical power than it otherwise would be.
The probabilistic matching approaches we deploy in this section also take advantage of the latest advancements in the field of machine learning for two primary reasons. First, administrative datasets often span hundreds of thousands and often millions of records. Probabilistic techniques involve computing similarity metrics across a number of identifying characteristics such as name and date of birth. It becomes prohibitively, computationally expensive to perform these calculations for each potential record pair as the administrative dataset size grows. Ideally, we would only perform these computations for records with high prior probability of referring to the same person. Techniques for detecting approximate nearest neighbors (Sadosky et al. 2015) allow for fast detection of likely matches that drastically reduce the number of comparisons that need to be made in the linking process. Second, the adaptivity of machine learning models for learning nonlinear functions and the practice of assessing performance on outofsample data lead to predictive accuracy that outperforms linear models such as logistic regression.
There are situations where the two rows in the dataset will match on all relevant variables save for the unique identifier. As it is ambiguous whether these rows refer to different individuals or if there is an error in the unique identifier, we drop these records from the empirical simulation. This reduces the number of records to 2.6 million.
RecordLink works by identifying potential matches across the two data sets and asking for human labels for pairs which the algorithm is most uncertain about. This information is then incorporated into the learning algorithm to improve predictions. A user providing labels has the option to stop at any point and have dedupe produce predictions based on the current version of the algorithm. To simulate a human providing responses, we modified dedupe’s code so that ground truth labels would be provided until either of the following conditions was met: the number of labels provided was equal to 50% of the experimental data set size, or the number of labels which identified a true positive links was greater than or equal to 75% of the number of true matches. For 95% of simulations, the number of labels provided was greater than 50% of the experimental data set size, and in 15% of simulations the number of labels provided was greater than 75% of the experimental data set size.
Further details of the algorithm to appear in Jelveh and McNeill (2018).
While ground truth data for record linkage is often hard to come by, in the context of lowcost RCTs it may actually be likely that the administrative data set being linked to will meet the conditions needed to deploy a supervised approach. In particular, the conditions that are needed for supervised learning are that the administrative data set contains a unique identifier (such as an agency identifier assigned by a police department, public hospital, or school system) and that a person can appear multiple times in the data set with the same unique identifier but with discrepancies between records in identifying characteristics.
We note that in a very small number of parameterizations, the share of errors overturned is negative indicating that exact matching leads to fewer linking errors than active learning. Common features of these parameterizations include low exact matching error rates, low overlap between the experimental and administrative datasets, and/or larger administrative datasets.
It is worthwhile to note that the descriptive consequences of linking error cannot be resolved by increasing sample size.
References
Abramitzky R, Boustan LP, Eriksson K, Feigenbaum JJ, Pérez S (2019, May). Automated linking of historical data. Working paper 25825, National Bureau of Economic Research
Aigner DJ (1973) Regression with a binary independent variable subject to errors of observation. J Econom 1(1):49–59
Angrist JD, Pischke JS (2009) Mostly harmless econometrics. Princeton University Press, Princeton, NJ
Arceneaux K, Gerber AS, Green DP (2010) A cautionary note on the use of matching to estimate causal effects: an empirical example comparing matching estimates to an experimental benchmark. Sociol Methods Res 39(2):256–282
Bailey M, Cole C, Henderson M, Massey C (2017, November) How well do automated methods perform in historical samples? Evidence from new ground truth. Working paper 24019, National Bureau of Economic Research
Banerjee AV, Duflo E (2009) The experimental approach to development economics. Annu Rev Econ 1(1):151–178
Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, Cesarini D, Chambers CD, Clyde M, Cook TD, De Boeck P, Dienes Z, Dreber A, Easwaran K, Efferson C, Fehr E, Fidler F, Field AP, Forster M, George EI, Gonzalez R, Goodman S, Green E, Green DP, Greenwald AG, Hadfield JD, Hedges LV, Held L, Hua Ho T, Hoijtink H, Hruschka DJ, Imai K, Imbens G, Ioannidis JPA, Jeon M, Jones JH, Kirchler M, Laibson D, List J, Little R, Lupia A, Machery E, Maxwell SE, McCarthy M, Moore DA, Morgan SL, Munafó M, Nakagawa S, Nyhan B, Parker TH, Pericchi L, Perugini M, Rouder J, Rousseau J, Savalei V, Schönbrodt FD, Sellke T, Sinclair B, Tingley D, Van Zandt T, Vazire S, Watts DJ, Winship C, Wolpert RL, Xie Y, Young C, Zinman J, Johnson VE (2018) Redefine statistical significance. Nat Hum Behav 2(1):6–10
Berent MK, Krosnick JA, Lupia A (2016) Measuring voter registration and turnout in surveys: Do official government records yield more accurate assessments? Public Opin Quart 80(3):597–621
Berk RA (2005) Randomized experiments as the bronze standard. J Exp Criminol 1(4):417–433
Bertrand M, Mullainathan S (2001) Do people mean what they say? Implications for subjective survey data. Am Econ Rev 91(2):67–72
Bilenko M (2004) Learnable similarity functions and their applications to clustering and record linkage. In: Proceedings of the Ninth AAAI/SIGART doctoral consortium, pp 981–982
Black SE, Devereux PJ, Salvanes KG (2005) Why the apple doesn’t fall far: understanding intergenerational transmission of human capital. Am Econ Rev 95(1):437–449
Boruch RF, Victor T, Cecil JS (2000) Resolving ethical and legal problems in randomized experiments. Crime Delinq 46(3):330–353
Bowers KJ, Johnson SD (2005) Domestic burglary repeats and spacetime clusters: the dimensions of risk. Eur J Criminol 2(1):67–92
Braga AA, Apel R (2016) And we wonder why criminology is sometimes considered irrelevant in realworld policy conversations. Criminol Public Policy 15(3):813–829
Braga AA, Bond BJ (2008) Policing crime and disorder hot spots: a randomized controlled trial. Criminology 46(3):577–607
Braga AA, Weisburd DL, Waring EJ, Mazerolle LG, Spelman W, Gajewski F (1999) Problemoriented policing in violent crime places: a randomized controlled experiment. Criminology 37(3):541–580
Branas CC, South E, Kondo MC, Hohl BC, Bourgois P, Wiebe DJ, MacDonald JM (2018) Citywide cluster randomized trial to restore blighted vacant land and its effects on violence, crime, and fear. Proc Nat Acad Sci 115(12):2946–2951
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Britt CL, Weisburd D (2010) Statistical power. In: Piquero AR, Weisburd D (eds) Handbook of quantitative criminology. Springer, pp 313–332
Broidy LM, Daday JK, Crandall CS, Sklar DP, Jost PF (2006) Exploring demographic, structural, and behavioral overlap among homicide offenders and victims. Homicide Stud 10(3):155–180
Camerer CF, Dreber A, Forsell E, Ho TH, Huber J, Johannesson M, Kirchler M, Almenberg J, Altmejd A, Chan T, Heikensten E, Holzmeister F, Imai T, Isaksson S, Nave G, Pfeiffer T, Razen M, Wu H (2016) Evaluating replicability of laboratory experiments in economics. Science 351(6280):1433–1436
Campbell KM (2009) Impact of recordlinkage methodology on performance indicators and multivariate relationships. J Subst Abuse Treat 36(1):110–117
Campbell KM, Deck D, Krupski A (2008) Record linkage software in the public domain: a comparison of Link Plus, the Link King, and a ’basic’ deterministic algorithm. Health Inform J 14(1):5–15
Cesarini D, Lindqvist E, Östling R, Wallace B (2016) Wealth, health, and child development: evidence from administrative data on swedish lottery players. Q J Econ 131(2):687–738
Chalfin A, Deza M (2017) The intergenerational effects of education on delinquency. J Econ Behav Org 159:553–571
Christen P (2012) Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer, New York
Christen P, Goiser K (2007) Quality and complexity measures for data linkage and deduplication. In: Guillet FJ, Hamilton HJ (eds) Quality measures in data mining. Springer, Berlin, pp 127–151
Clark DE, Hahn DR (1995) Comparison of probabilistic and deterministic record linkage in the development of a statewide trauma registry. Proc Annu Symp Comput Appl Med Care 1995:397–401
Cohen J (1992) Statistical power analysis. Curr Dir Psychol Sci 1(3):98–101
Comfort M, Nurse AM, McKay T, Kramer K (2011) Taking children into account: addressing the intergenerational effects of parental incarceration. Criminol Publ Policy 10(3):839–850
Cook PJ, Kang S, Braga AA, Ludwig J, O’Brien ME (2015) An experimental evaluation of a comprehensive employmentoriented prisoner reentry program. J Quant Criminol 31(3):355–382
Courtemanche CJ, Denteh A, Tchernis R (2018) Estimating the associations between snap and food insecurity, obesity, and food purchases with imperfect administrative measures of participation. Technical report, National Bureau of Economic Research
Curb JD, Ford CE, Pressel S, Palmer M, Babcock C, Hawkins CM (1985) Ascertainment of vital status through the national death index and the social security administration. Am J Epidemiol 121(5):754–766
Dahl GB, Kostøl AR, Mogstad M (2014) Family welfare cultures. Q J Econ 129(4):1711–1752
Davis RC, Taylor BG (1997) A proactive response to family violence: the results of a randomized experiment. Criminology 35(2):307–333
Deaton A (2010) Instruments, randomization, and learning about development. J Econ Lit 48(2):424–55
Dezember A, Stoltz M, Marmolejo L, Kanewske LC, Feingold KD, Wire S, Duhaime L, Maupin C (2020) The lack of experimental research in criminology–evidence from criminology and justice quarterly. J Exp Criminol
DiNardo JE, Pischke JS (1997) The returns to computer use revisited: Have pencils changed the wage structure too? Q J Econ 112(1):291–303
Dobbie W, Goldin J, Yang CS (2018) The effects of pretrial detention on conviction, future crime, and employment: evidence from randomly assigned judges. Am Econ Rev 108(2):201–240
Dobbie W, Grönqvist H, Niknami S, Palme M, Priks M (2018) The intergenerational effects of parental incarceration. Technical report, National Bureau of Economic Research
Dobkin C, Finkelstein A, Kluender R, Notowidigdo MJ (2018) The economic consequences of hospital admissions. Am Econ Rev 108(2):308–352
Doleac JL, Temple C, Pritchard D, Roberts A (2020) Which prisoner reentry programs work? Replicating and extending analyses of three \(RCT\)s. Int Rev Law Econ 62:105902
Duflo E, Glennerster R, Kremer M (2007) Using randomization in development economics research: a toolkit. In: Schultz TP, Strauss JA (eds) Handbook of development economics, vol 4. NorthHolland, Amsterdam, pp 3895–3962
Dunn PM (1997) James lind (1716–94) of Edinburgh and the treatment of scurvy. Arch Dis ChildhoodFetal Neonatal Ed 76(1):F64–F65
Duwe G (2012) Evaluating the Minnesota comprehensive offender reentry plan (mcorp): results from a randomized experiment. Justice Q 29(3):347–383
Duwe G (2014) A randomized experiment of a prisoner reentry program: updated results from an evaluation of the Minnesota comprehensive offender reentry plan (MCORP). Crim Justice Stud 27(2):172–190
Enamorado T (2018) Active learning for probabilistic record linkage. Available at SSRN 3257638
Enamorado T, Fifield B, Imai K (2019) Using a probabilistic model to assist merging of largescale administrative records. Am Polit Sci Rev 113(2):353–371
Farabee D, Zhang SX, Wright B (2014) An experimental evaluation of a nationally recognized employmentfocused offender reentry program. J Exp Criminol 10(3):309–322
Farrington DP (1983) Randomized experiments on crime and justice. Crime Justice 4:257–308
Farrington DP (2003) A short history of randomized experiments in criminology: a meager feast. Eval Rev 27(3):218–227
Farrington DP (2006) Key longitudinalexperimental studies in criminology. J Exp Criminol 2(2):121–141
Farrington DP, Welsh BC (2006) A half century of randomized experiments on crime and justice. Crime Justice 34(1):55–132
Feigenbaum JJ (2016) Automated census record linking: a machine learning approach. Working paper
Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64(328):1183–1210
Ferrante A (1993) Developing an offenderbased tracking system: the western australia inois project. Aust NZ J Criminol 26(3):232–250
Fischbacher U, Gächter S, Ernst F (2001) Are people conditionally cooperative? Evidence from a public goods experiment. Econ Lett 71(3):397–404
Fisher RA (1936) Design of experiments. Br Med J 1(3923):554–554
Freudenberg N, Wilets I, Greene MB, Richet BE (1998) Linking women in jail to community services: factors associated with rearrest and retention of drugusing women following release from jail. J Am Med Women’s Assoc 53(2):89–93
Geerken MR (1994) Rap sheets in criminological research: considerations and caveats. J Quant Criminol 10(1):3–21
Gelber A, Isen A, Kessler JB (2016) The effects of youth employment: evidence from New York City lotteries. Q J Econ 131(1):423–460
Gerber AS, Malhotra N (2008) Do statistical reporting standards affect what is published? Publication bias in two leading political science journals. Q J Polit Sci 3(3):313–326
Gilbert DT, King G, Pettigrew S, Wilson TD (2016) Comment on “Estimating the reproducibility of psychological science”. Science 351(6277):1037
Gomatam S, Carter R, Ariet M, Mitchell G (2002) An empirical comparison of record linkage procedures. Stat Med 21(10):1485–1496
Gordon BR, Zettelmeyer F, Bhargava N, Chapsky D (2019) A comparison of approaches to advertising measurement: evidence from big field experiments at Facebook. Mark Sci 38(2):193–225
Gottfredson DC, Najaka SS, Kearley BW, Rocha CM (2006) Longterm effects of participation in the Baltimore City drug treatment court: results from an experimental study. J Exp Criminol 2(1):67–98
Hansen B (2015) Punishment and deterrence: evidence from drunk driving. Am Econ Rev 105(4):1581–1617
Hansen B, Waddell GR (2018) Legal access to alcohol and criminality. J Health Econ 57:277–289
Harron K, Dibben C, Boyd J, Hjern A, Azimaee M, Barreto ML, Goldstein H (2017) Challenges in administrative data linkage for research. Big Data Soc 4(2):1–12
Heckman JJ, Smith JA (1995) Assessing the case for social experiments. J Econ Perspect 9(2):85–110
Heller SB (2014) Summer jobs reduce violence among disadvantaged youth. Science 346(6214):1219–1223
Heller SB, Shah AK, Guryan J, Ludwig J, Mullainathan S, Pollack HA (2017) Thinking, fast and slow? Some field experiments to reduce crime and dropout in Chicago. Q J Econ 132(1):1–54
Hill SJ (2017) Changing votes or changing voters? How candidates and election context swing voters and mobilize the base. Electoral Stud 48:131–148
Hjalmarsson R, Lindquist MJ (2012) Like godfather, like son: exploring the intergenerational nature of crime. J Hum Resour 47(2):550–582
Lyngstad TH, Skardhamar T (2011) Nordic register data and their untapped potential for criminological knowledge. Crime Justice 40(1):613–645
Hyatt JM, Andersen SN (2019) On the potential of incorporating administrative register data into randomized experiments. J Exp Criminol 15(3):469–497
Imbens GW (2010) Better LATE than nothing: some comments on Deaton (2009) and Heckman and Urzua (2009). J Econ Lit 48(2):399–423
Ioannidis JP, Stanley TD, Doucouliagos H (2017) The power of bias in economics research. Econ J 127(605):F236–F265
Jelveh Z, McNeill M (2018) Downstream impact of probabilistic matching quality on prediction performance. Manuscript in progress.
Johndrow J, Lum K, Dunson D (2018) Theoretical limits of microclustering for record linkage. Biometrika 105(2):431–446
Johnston A, Mas A (2018) Potential unemployment insurance duration and labor supply: the individual and marketlevel response to a benefit cut. J Polit Econ 126(6):2480–2522
Karr AF, Taylor MT, West SL, Setoguchi S, Kou TD, Gerhard T, Horton DB (2019) Comparing record linkage software programs and algorithms using realworld data. PLoS ONE 14(9):e0221459
Keizer K, Lindenberg S, Steg L (2008) The spreading of disorder. Science 322(5908):1681–1685
Khwaja AI, Mian A (2005) Do lenders favor politically connected firms? Rent provision in an emerging financial market. Q J Econ 120(4):1371–1411
Lahiri P, Larsen MD (2005) Regression analysis with linked data. J Am Stat Assoc 100(469):222–230
LaLonde RJ (1986) Evaluating the econometric evaluations of training programs with experimental data. Am Econ Rev 76(4):604–620
Lane J (2018) Building an infrastructure to support the use of government administrative data for program performance and social science research. Ann Am Acad Polit Soc Sci 675(1):240–252
Lattimore PK, MacKenzie DL, Zajac G, Dawes D, Arsenault E, Tueller S (2016) Outcome findings from the hope demonstration field experiment: is swift, certain, and fair an effective supervision strategy? Criminol Public Policy 15(4):1103–1141
Laub JH, Sampson RJ (1988) Unraveling families and delinquency: a reanalysis of the Gluecks’data. Criminology 26(3):355–380
Lauritsen JL (1999) Limitations in the use of longitudinal selfreport data: a comment. Criminology 37(3):687–694
Liberman AM, Kirk DS, Kim K (2014) Labeling effects of first juvenile arrests: secondary deviance and secondary sanctioning. Criminology 52(3):345–370
Loeffler CE (2013) Does imprisonment alter the life course? Evidence on crime and employment from a natural experiment. Criminology 51(1):137–166
Lum C, Mazerolle L (2014). History of randomized controlled experiments in criminal justice. Encycl Criminol Criminal Justice 2227–2239
Lynch J (2018) Not even our own facts: criminology in the era of big data. Criminology 56(3):437–454
MacDonald JM, Morral AR, Raymond B, Eibner C (2007) The efficacy of the Rio Hondo DUI court: a 2year field experiment. Eval Rev 31(1):4–23
McCord J (2003) Cures that harm: unanticipated outcomes of crime prevention programs. Ann Am Acad Polit Soc Sci 587(1):16–30
Moher D, Dulberg CS, Wells GA (1994) Statistical power, sample size, and their reporting in randomized controlled trials. J Am Med Assoc 272(2):122–124
Moore CL, Amin J, Gidding HF, Law MG (2014) A new method for assessing how sensitivity and specificity of linkage studies affects estimation. PLoS ONE 9(7):e103690
Morris NA, Slocum LA (2010) The validity of selfreported prevalence, frequency, and timing of arrest: an evaluation of data collected using a life event calendar. J Res Crime Delinq 47(2):210–240
MuellerSmith M (2015) The criminal and labor market impacts of incarceration. Working paper
Nagin DS, Sampson RJ (2019) The real gold standard: measuring counterfactual worlds that matter most to social science and policy. Annu Rev Criminol 2:123–145
Nelson MS, Wooditch A, Dario LM (2015) Sample size, effect size, and statistical power: a replication study of Weisburd’s paradox. J Exp Criminol 11(1):141–163
Neter J, Maynes ES, Ramanathan R (1965) The effect of mismatching on the measurement of response errors. J Am Stat Assoc 60(312):1005–1027
Newcombe HB, Kennedy JM, Axford S, James AP (1959) Automatic linkage of vital records. Science 330(3381):954–959
Orchowsky S, Iwama J (2009) Improving state criminal history records: recidivism of sex offenders released in 2001. Report, Justice Research and Statistics Association
O’Brien DT, Sampson RJ (2015) Public and private spheres of neighborhood disorder: assessing pathways to violence using largescale digital records. J Res Crime Delinq 52(4):486–510
Petrosino A, TurpinPetrosino C, Buehler J (2003) Scared straight and other juvenile awareness programs for preventing juvenile delinquency: a systematic review of the randomized experimental evidence. Ann Am Acad Polit Soc Sci 589(1):41–62
Powell D, Seabury S (2018) Medical care spending and labor market outcomes: evidence from workers’ compensation reforms. Am Econ Rev 108(10):2995–3027
Powers E, Witmer H (1951) An experiment in the prevention of delinquency; the CambridgeSomerville youth study. Columbia University Press, New York
Price J, Buckles K, Van Leeuwen J, Riley I (2019, September). Combining family history and machine learning to link historical records. Working paper 26227, National Bureau of Economic Research
Pridemore WA, Makel MC, Plucker JA (2018) Replication in criminology and the social sciences. Annu Rev Criminol 1(1):19–38
Ridgeway G, Grogger J, Moyer RA, MacDonald JM (2019) Effect of gang injunctions on crime: a study of Los Angeles from 1988–2014. J Quant Criminol 35(3):517–541
Roberts J, Wells W (2010) The validity of criminal justice contacts reported by inmates: a comparison of selfreported data with official prison records. J Criminal Justice 38(5):1031–1037
Rossman SB, Roman JK, Zweig JM, Rempel M, Lindquist CH (2011) The multisite adult drug court evaluation: executive summary. Urban Institute
Rothstein HR (2008) Publication bias as a threat to the validity of metaanalytic results. J Exp Criminol 4(1):61–81
Rubin DB (2008) Comment: the design and analysis of gold standard randomized experiments. J Am Stat Assoc 103(484):1350–1353
Sadosky P, Shrivastava A, Price M, Steorts RC (2015) Blocking methods applied to casualty records from the Syrian conflict. arXiv preprint arXiv:1510.07714
Sampson RJ (2010) Gold standard myths: observations on the experimental turn in quantitative criminology. J Quant Criminol 26(4):489–500
Sampson RJ, Laub JH (2003) Lifecourse desisters? Trajectories of crime among delinquent boys followed to age 70. Criminology 41(3):555–592
Sampson RJ, Winter AS (2018) Poisoned development: assessing childhood lead exposure as a cause of crime in a birth cohort followed through adolescence. Criminology 56(2):269–301
Sariyar M, Borg A, Pommerening K (2012) Active learning strategies for the deduplication of electronic patient data using classification trees. J Biomed Inform 45(5):893–900
Scheuren F, Winkler WE (1993) Regression analysis of data files that are computer matched, part i. Surv Methodol 19(1):39–58
Scheuren F, Winkler WE (1997) Regression analysis of data files that are computer matched, part ii. Surv Methodol 23(2):157–165
Sedlmeier P, Gigerenzer G (1989) Do studies of statistical power have an effect on the power of studies? Psychol Bull 105(2):309–316
Sherman LW (2007) The power few: experimental criminology and the reduction of harm. J Exp Criminol 3(4):299–321
Sherman W, Berk RA (1984) The Minneapolis domestic violence experiment. Police Foundation, Washington
Sherman LW, Weisburd D (1995) General deterrent effects of police patrol in crime “hot spots”: a randomized, controlled trial. Justice Q 12(4):625–648
Sherman LW, Schmidt JD, Rogan DP, Smith DA (1992) The variable effects of arrest on criminal careers: the Milwaukee domestic violence experiment. J Criminal Law Criminol 83(1):137–169
Sherman LW, Gottfredson DC, MacKenzie DL, Eck J, Reuter P, Bushway S (1997) Preventing crime: what works, what doesn’t, what’s promising: a report to the United States Congress. US Department of Justice, Office of Justice Programs, Washington, DC
Smith GJD, Bennett Moses L, Chan J (2017) The challenges of doing criminology in the big data era: towards a digital and datadriven approach. Br J Criminol 57(2):259–274
Smith JA, Todd PE (2001) Reconciling conflicting evidence on the performance of propensityscore matching methods. Am Econ Rev 91(2):112–118
Smith JA, Todd PE (2005) Does matching overcome lalonde’s critique of nonexperimental estimators? J Econom 125(1–2):305–353
SplawaNeyman J, Dabrowska DM, Speed TP (1990[1923]). On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Stat Sci 5(4):465–472
Stewart A, Dennison S, Allard T, Thompson C, Broidy L, Chrzanowski A (2015) Administrative data linkage as a tool for developmental and lifecourse criminology: The Queensland linkage project. Aust NZ J Criminol 48(3):409–428
Taxman FS, Caudy MS (2015) Risk tells us who, but not what or how. Criminol Public Policy 14(1):71–103
Tremblay RE, Vitaro F, Nagin D, Pagani L, Seguin JR (2003) The montreal longitudinal and experimental study. In: Taking stock of Delinquency. Springer, pp 205–254
Tromp M, Ravelli AC, Bonsel GJ, Hasman A, Reitsma JB (2011) Results from simulated data sets: Probabilistic record linkage outperforms deterministic record linkage. J Clin Epidemiol 64(5):565–572
Van Schellen M, Apel R, Nieuwbeerta P (2012) The impact of military service on criminal offending over the life course: evidence from a Dutch conviction cohort. J Exp Criminol 8(2):135–164
Vivalt E (2017) The trajectory of specification searching and publication bias across methods and disciplines. Working paper
Watson CI, Fiumara GP, Tabassi E, Salamon WJ, Flanagan PA (2014) Fingerprint vendor technology evaluation. Report, NIST
Weisburd D (2003) Ethical practice and evaluation of interventions in crime and justice: the moral imperative for randomized trials. Eval Rev 27(3):336–354
Weisburd D (2010) Justifying the use of nonexperimental methods and disqualifying the use of randomized controlled trials: challenging folklore in evaluation research in crime and justice. J Exp Criminol 6(2):209–227
Weisburd D, Petrosino A, Mason G (1993) Design sensitivity in criminal justice experiments. Crime Justice 17:337–379
Wildeman C, Andersen SH (2017) Paternal incarceration and children’s risk of being charged by early adulthood: evidence from a Danish policy shock. Criminology 55(1):32–58
Winkler WE (2002) Methods for record linkage and Bayesian networks. U.S. Bureau of the Census, Statistical Research Division
Winkler WE (2006) Overview of record linkage and current research directions. U.S. Bureau of the Census
Yancey WE (2004) Improving EM algorithm estimates for record linkage parameters. U.S. Bureau of the Census
Zingmond DS, Ye Z, Ettner SL, Liu H (2004) Linking hospital discharge and death recordsaccuracy and sources of bias. J Clin Epidemiol 57(1):21–29
Acknowledgements
We extend our sincere thanks to Melissa McNeill at the University of Chicago Crime Lab for her work in developing the records matching algorithm employed in this paper. We would also like to thank Leslie Kellam, Ryang Hui Kim, Srivatsa Kothapally, Jens Ludwig, Jim Lynch, Mike MuellerSmith, Aurelie Ouss, Greg Ridgeway, Jesse Rothstein and Greg Stoddard for helpful comments on this project. We thank Arnold Ventures for its generous support of the University of Chicago Crime Lab New York. Points of view or opinions contained within this document are those of the authors. They do not necessarily represent those of Arnold Ventures. Of course, all remaining errors are our own.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Computational Details
In this appendix we provide additional details for how statistical power can be computed under two possible states of the world: (1) in the absence of linking errors and (2) in the presence of linking errors. We use the derivations in this appendix to empirically demonstrate the effect of linking errors on statistical power in a hypothetical experiment in “Empirical Example” section of the paper.
For illustrative purposes, we will assume that a roster of individuals involved in a treatment program is being linked to arrest data to measure whether the program reduced the likelihood of arrest. Additionally, we will assume a recordlinkage algorithm was run on the arrest data and that there existed a unique identifier allowing us to measure when predicted links between two records represented true and false matches and when predicted nonlinks represtented true and false nonmatches.
We motivate the derivation by introducing a framework—a confusion matrix—that governs the incidence of linking errors in the arrest. Each row of the confusion matrix represents the incidence of an actual class (true nonmatch and true match) while each column represents the instances in a predicted class (predicted nonlink and predicted link). The matrix thus allows us to understand the extent to which the algorithm is successful in classifying that two records belong to the same person.
In the following confusion matrix, \(y^*\) represents the true state of the world and y represents the observed state of the world after linking. The cells provide counts of the number of true negatives, false negatives, false positives and true positives, respectively in linking the data.
\(y^*\) = 0  \(y^*\) = 1  

y = 0  TN  FN 
y = 1  FP  TP 
The diagonal entries of the matrix correspond to an alignment of the true and observed states of the world—observations for which \(y^*\) = y = 0 are true negatives and observations for which \(y^*\) = y = 1 are true positives. The offdiagonal entries provide us with the number of linking errors. In particular, the 2,1 element of the matrix provides the number of false positive links—this is the number of times in which an observation which is truly \(y^*\) = 0 is mistakenly linked to y = 1. Similarly, the 1,2 element of the matrix provides the number of false negative links where an observation that is truly \(y^*\) = 1 is mistakenly linked to a record for which y = 0.
The matrix allows us to compute four different rates capturing the success of a given linking strategy: the true positive and true negative rate and the false positive and false negative rate.
The true positive rate (TPR) is defined as the number of linked positives divided by the number of true positives (TP+FN). Likewise the true negative rate (TNR) is the number of linked negatives divided by the number of true negatives (TN+FP). The corresponding false positive and false negative link rates are obtained by subtracting each of these quantities from 1. As we show in “Derivation of Estimated Treatment Effects, Standard Errors and Statistical Power” section of the paper, estimated treatment effects will be attenuated under linking errors and the attenuation will be proportional to 1FPRFNR. So long as FPR+FNR < 1, this will be strict attenuation towards zero but if FPR+FNR exceeds 1 then there can be a change in the sign of the bias.
To appreciate how this works, assume that the arrest dataset contained N = 10,000 records and that after running the matching algorithm, the following confusion matrix was generated:
\(y^*\) = 0  \(y^*\) = 1  

y = 0  TN=3000  FN=1000 
y = 1  FP=2000  TP=4000 
The error rates for the matching algorithm can be computed as:

FNR = \(\frac{1000}{4000 + 1000}\) = 0.20

FPR = \(\frac{2000}{3000 + 2000}\) = 0.40
Assume that to test the effectiveness of the treatment program, 1500 individuals where randomized, with onehalf in the treatment group (p = 0.5), the control group mean was \(\frac{1}{3}\), and that the treatment effect was \(\tau = \frac{1}{15}\).
In the true state of the world, there are 250 individuals arrested in the control group and 200 in the treatment group, reflecting the fact that \(\tau = \frac{1}{15}\). We next apply the error rates from the matching algorithm to the control and treatment groups respectively to generate the following confusion matrices:
Control group
\(y^*\) = 0  \(y^*\) = 1  

y = 0  TN=300  FN=50 
y = 1  FP=200  TP=200 
Treatment group
\(y^*\) = 0  \(y^*\) = 1  

y = 0  TN=330  FN=40 
y = 1  FP=220  TP=160 
Let \(y^*_{T=0}\) be the true number of individuals arrested in the control group and \(y_{T=0}\) be the observed number of individuals arrested in the control group. We see that \(y^*_{T=0}=50+200=250\) and \(y_{T=0}=200+200=400\).
Let \(y^*_{T=1}\) be the true number of individuals arrested in the treatment group and \(y_{T=1}\) be the observed number of individuals arrested in the treatment group. We see that \(y^*_{T=1}=40+160=200\) and \(y_{T=1}=220+160=380\).
The observed treatment effect can be computed as \(\bar{y}_{T=1}  \bar{y}_{T=0} = \frac{380}{750}  \frac{400}{750} = \frac{2}{75}\), which is equivalent to \(\tau * (TPR  FPR) = \frac{1}{15} * (0.8  0.4) = \frac{2}{75}\)
In order to compute statistical power to detect a given potential treatment effect, we need to compute a standard error which is computed according to:
The square root of this quantity is the standard error around the estimated treatment effect. N and p are simply the sample size and the proportion treated but we will need to compute \(\varsigma\) which is the mean square error from a regression of either \(y^*\) or y on D, depending on which state of the world we are in. We show how to compute \(\sigma ^2\) in absence and presence of linking errors in "Appendix 2".
We can then compute statistical power according to:
Carrying through the numerical example from our confusion table, power to detect a treatment effect of 20% in these data is 81% in the true state of the world and just 72% in the state of the world with linking errors. What would have been an adequately wellpowered experiment is no longer wellpowered in the presence of modest linking errors.
Appendix 2: Deriving Outcome Variance
In this section we show how to compute the residual sum of squares with a binary outcome and binary treatment in order to compute the \(\sigma ^2\). Let \(\bar{y}_C\) equal the control group mean and \(\tau\) the treatment effect:
We can decompose the above equation into four mutually exclusive groups determined by whether an individual is in the treatment or control group, and whether their associated outcome is \(y=0\) or \(y=1\).
Appendix 3: Maximizing RSS
We now show why Eq. 10 is maximized when the control group mean, \(\bar{y}_C\), plus the treatment effect, \(\tau\), equal 0.5. Let \(N_{T,1}\) equal the number of individuals in the treatment group with \(y=1\) and \(N_{T,0}\) equal the number of individuals in the treatment group with \(y=0\). Note that \(N_{T,1} = (\bar{y}_C+ \tau ) N_T\) and \(N_{T,0} = N_T (1  (\bar{y}_C+\tau ))\).
For a given control group mean we will take derivatives with respect to \(\tau\), which means we will only consider individuals in the treatment group. We can decompose the previous equation into:
Let \(\kappa = N_T (1  \bar{y}_C \tau ) (\bar{y}_C + \tau )\), then taking derivatives with respect to \(\tau\):
Setting the previous equation to zero and solving for \(\tau\) leads to
Appendix 4: Proof for Power Attenuation
In this section we show that even when the standard error estimated under linking error is smaller than the standard error estimated under no error, statistical power will still be larger for the latter scenario. Let \(\eta\) be True Positive Rate, \(\omega\) the False Positive Rate, \(tau^*\) the true treatment effect, \(\sigma _{\tau ^*}\) the true standard error, \(\hat{\tau }\) the observed treatment effect, and \(\sigma _{\hat{\tau }}\) the observed standard error. Note that \(0 \le \eta \le 1\) and \(0 \le \omega \le 1\). In order to show that
we use the result from Eq. 7 to substitute for the observed treatment effect to get
and then rearrange terms to get the following:
It is straightforward to show that this is equivalent to
Or that
where \(\widehat{RSS}\) is the residual sum of squares with linking error and \(RSS^*\) is the residual sum of squares without linking error.
In the following, \(N_{j,1}^{*}\) represents the number of observations for which the true value of y, \(y^{*}=1\) and \(N_{j,0}^{*}\) represents the number of observations for which the true value of y, \(y^{*}=0\). This allows us to write the last inequality above as
All terms in the numerator of the last inequality are greater than zero, satisfying the condition.
Appendix 5: Treatment Heterogeneity Correlated With Matching Error
In this section we demonstrate how treatment effect heterogeneity that’s correlated with linking error can impact coefficient estimates. Consider a dichotomous covariate G which takes on two values M and F. We assume that linking error rates within group are equal across treatment and control but that \(TPR_{M} \ne TPR_{F}\) and \(FPR_{M} \ne FPR_{F}\). Further, assume that \(\tau _{M} \ne \tau _{F}\). We rewrite Eq. 4 as:
When both \(\tau _M\) and \(\tau _F\) are in the same direction, then linking error will only attenuate the pooled treatment effect in absolute value. However, if the signs of \(\tau _M\) and \(\tau _F\) are different, then the observed treatment effect may be greater than the true average treatment effect in absolute value.
Rights and permissions
About this article
Cite this article
Tahamont, S., Jelveh, Z., Chalfin, A. et al. Dude, Where’s My Treatment Effect? Errors in Administrative Data Linking and the Destruction of Statistical Power in Randomized Experiments. J Quant Criminol 37, 715–749 (2021). https://doi.org/10.1007/s1094002009461x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1094002009461x
Keywords
 Randomized experiments
 Administrative data
 Record linking
 Machine learning