Introduction

Over the last several decades, much has been written about the impact of criminal justice innovation and reform. One of the most recent contributions, Megan Stevenson’s (2023) “Cause, Effect, and the Structure of the Social World,” revisits well-documented issues, including the limited impact of many of these efforts, and the challenges of replicating the results of randomized controlled trials (RCTs). She asserts that her article is “built around a central empirical claim: most reforms and interventions in the criminal legal space are shown to have little lasting impact when evaluated with gold standard methods of causal inference” (p. 2023).Footnote 1 She then goes on to say this reveals the structure of our social world, a structure which is resistant to the type of interventions amenable to evaluation by RCTs. Believing otherwise, she says, amounts to embracing a myth.

Most new claims about the limited impact of RCTs do not generate further discussion among researchers, many of whom are engaged in efforts to address this issue. However, Stevenson’s article generated interest across a wide audience, presumably because she used that claim to assert the experimental work of her colleagues is predicated on an unacknowledged myth, their most cherished methods inherently limit the prospects of meaningful social change, and their careers in experimental research are destined to accomplish little of lasting significance. Some readers may also have taken at face value Stevenson’s claim that these concerns have gone unacknowledged, giving the impression her work breaks new ground. Since publication, her article generated hundreds of retweets on X (Stevenson, 2024, January 2), hundreds of thousands of views, a series of discussions in scholarly venues, and proclamations that it should serve as a cornerstone of syllabi about criminal justice reform. Motivated by Stevenson’s article, Vital City, a journal of data-driven urban policy, commissioned a series of essays that weighed in on the role and limits of empirical evidence in social change (Glazer et al., 2024).

Here, we examine the methods Stevenson used to derive her central claims. We will argue that as a work published in a law journal, her methods were not subject to the methodological scrutiny normally accorded to analyses that pool empirical evidence to make a global claim about something as profound as the structure of our social world. Under more thorough examination, we believe her methods violate prohibitions on pooling heterogenous studies for the purposes of meta-analysis, and this error precludes her ability to draw global conclusions about criminal justice RCTs. Since Stevenson uses this to derive her conclusion about the structure of the social world, its failure as a method precludes the ability to make this second claim. As a result, Stevenson’s conclusions speak beyond the data.

We endeavor to do more than challenge the claims of one paper. Stevenson has done the service of revealing gaps in people’s knowledge about both ongoing efforts to ensure experimental research better scales up and translates to other settings, and the real hazards of ceding time and resources to less rigorous ways of pursuing innovation and reform. Implementation science, a comparatively nascent field designed to assist in such scaling and successful translation, is thriving in healthcare and medicine and making inroads into criminal justice, a field that would be well-served by speeding up its use (del Pozo et al., 2024). At the same time, we will argue that sweeping alternatives to experimentally-driven change rely on fast jumps toward idealistic endstates, are often pursued as articles of faith, and are impossible to reliably achieve unless we carefully unpack their constituent steps. This brings us back to the need for rigorous research. In this way, Stevenson lays the groundwork for aligning criminal justice research methods with the progress we are witnessing in public health, medicine, and other fields that share the goal of creating healthy, resilient, and just communities through rigorous science.

RCTs, their limits, and the structure of our world

In “Cause, Effect, and the Structure of the Social World,” Stevenson (2023) begins with what she calls “big claims” (p. 2043) about RCTs in the broadly construed field of criminal justice. She suggests that the interventions amenable to rigorous evaluation by RCTs are by their nature too limited to generate high-impact change, and that ultimately, such changes cannot be “engineered,” since the world is too complex and yet too stable to be susceptible to such engineering. She argues that reforms and interventions in the field of criminal justice demonstrate little progress, and that this lack of meaningful impact reveals the structure of our social world. She then concludes that the belief that social change can arise from the types of interventions testable by RCTs is contingent upon believing a myth about the world we live in (pp. 2038, 2040, et alibi).

The implication of this argument is that we should reject or seriously curtail the criminal justice RCT as a “gold standard” of research. This is a bold claim, given its implications for the focus and legitimacy of the National Institute of Justice and private research philanthropies such as Arnold Ventures, which fund RCT research with the stated goals of “delivering precise, reliable processes capable of generating consistent, repeatable outcomes,” (NIJ) and “correcting system failures through evidence-based solutions… to create change that outlasts our finding” (Arnold) (p. 2038).Footnote 2 The alternatives Stevenson points toward include the more sweeping systemic changes of Marxism, embracing the indeterminism of Hayek, or accepting other, unspecified approaches (p. 2044). Stevenson claims her conclusion is not simply a matter of informed opinion, but meets an evidentiary standard. She states, “this is an evidence-based Article, in that I build my entire argument around evidence derived from RCTs” (p. 2041, emphasis added), adding “my claim is that these studies teach us something broader about the structure of the social world. This is an inductive argument.”Footnote 3 Figure 1 provides its formal expression.

Fig. 1
figure 1

The formal argument of "Cause, Effect, and the Structure of the Social World" (Stevenson, 2024)

In this argument, the successful induction of the second assertion depends entirely on a finding that criminal justice RCTs do not work in a global sense, and then the rest of the conclusions follow. As a self-described empirical claim, the strength of the argument therefore rests on the analytical power of the first conclusion, i.e., it is strong enough for a person to globally assert that (1) RCTs in criminal justice do not contribute to substantial social change in some broad, aggregate way, and (2) this reveals something conclusive about the structure of the world. It is also critical to note the explicit directionality of this reasoning: the structure of the world is revealed by empirical findings about RCTs, and not the other way around. In other words, if an analysis of RCTs cannot yield Stevenson’s conclusions about their ineffectiveness, then her thesis about our social world remains unproven and cannot be used to explain, in the other direction, why RCTs do not work.Footnote 4

Evaluating Stevenson’s claims

We can first observe that Stevenson presumes RCTs are failures when they demonstrate negative results. Many would contest this presumption. The lessons McCord (2003) and others drew from the famous null findings of the Cambridge-Somerville Youth Study greatly improved the design of the Montreal longitudinal-experimental study (Tremblay et al., 1992) and made the case for ceasing certain deleterious and wasteful approaches to delinquency prevention. If they prevent the misallocation of resources, prevent communities from pinning their hopes on something that will not deliver, identify the unforeseen harms of an intervention, or determine how different interventions with shared goals compare, then null RCTs should be deemed successful.

There are other reasons to question the conclusions Stevenson reaches about RCTs, such as the use of a highly incomplete convenience sample, a focus on straw man arguments about “engineers” and people's belief in the immense cascading effects of their favored interventions, and a categorial disregard of the emerging field of implementation science. Some of these issues are considered below. First, however, we will examine whether Stevenson’s methods have the scientific rigor necessary to make her key inductions. We will argue they violate the methodological requirements of meta-analyses by pooling highly heterogeneous studies for the purpose of generalizing about them as a group, thereby precluding validity. This is the critical point upon which the empirical argument of our paper rests.

Narrative reviews and meta-analyses: a methodological primer

First, we need to explain why we are treating Stevenson’s argument as a meta-analysis. One of the challenges of assessing her paper is that she makes a piercing empirical claim about the structure of our world without clearly laying out an approach that will power it. Regarding methods, she says.

My strategy here is threefold. First, I take a wide lens, and discuss findings from a broad survey study of RCTs in a variety of criminal justice topics. Second, I zoom in on several of the most prominent and influential studies of the last few decades, studies in which the effects were so promising that multiple replication studies were attempted. Third, I move through a variety of popular, highly-studied interventions in criminal justice and discuss the evidence associated with each... This three-part strategy certainly falls short of definitive proof; those who arrive skeptical of my claim may not walk away fully convinced. Nonetheless, I hope it is eye opening. (p. 2020)

This renders her approach a type of narrative review.Footnote 5 However, her strategy as written cannot plausibly reveal the structure of our social world or render a given conception of it a myth. Canvassing a selection of individual RCTs piecemeal cannot reveal how the world must be structured to have prevented each from having a meaningful effect. This method’s only induction is that future RCTs will probably also fail, but for many possible reasons, and Stevenson’s analytic strategy cannot tell us which reasons, or why. As an example: observing that a person never responds to you despite hundreds of calls, emails, and texts allows you to confidently presume they will not respond the next time you try, but it does not reveal why they are not responding. Likewise, an analysis that suggests individual RCTs will keep failing because RCTs very often have does not provide evidence that one possible reason (e.g., a particular structure of our social world) holds decisive explanatory power while other plausible reasons do not. This problem alone would suggest Stevenson’s conclusions about the world are unsupported by her methods.

Our response intends, however, to construe her strategy as one that could plausibly work. It could, by our view, if Stevenson is offering a de facto heuristic meta-analysis conveyed in narrative form,Footnote 6 allowing her to “systematically pool together all relevant research in order to clarify findings and form conclusions based on all currently available information” (Rosenthal & Schisterman, 2010). By this construal, Stevenson asserts that the pooled, or global, treatment effect of the criminal justice RCTs she selected for her study is not significantly different from zero,Footnote 7 so the thing we call a “criminal justice RCT” is ineffective at producing change. This type of global conclusion across a large and powerful body of pooled RCTs allows her to opine about the causal structure of the world with much greater confidence. In other words, as a plausible method, we argue Stevenson uses a narrative heuristic to do the work of a formal meta-analysis. We will examine it accordingly.

In doing so, we argue that if, for reasons of analytic integrity, we could not use a formal meta-analysis to conclude that criminal justice RCTs do not have a global treatment effect meaningfully distinguishable from the null, then we cannot use a heuristic meta-analysis to draw this conclusion either. If we cannot draw this heuristic conclusion, there is insufficient reason to reject the hypothesis that criminal justice RCTs “work,” and we cannot draw conclusions about the structure of the social world. Again, keep in mind the directionality of Stevenson’s inductive claim: it requires a certain meta-analytic conclusion about criminal justice RCTs to say anything about the social world. It cannot, in reverse, proceed from an observed assumption about the social world to explain the history of RCT performance. The corresponding formal argument is laid out in Fig. 2.

Fig. 2
figure 2

The formal argument presented here

As a technique, meta-analysis “combines or integrates the results of several independent clinical trials considered by the analyst to be ‘combinable’” (Huque, 1988, p. 29). The standards for plausible combination have been extensively studied, concluding that they require a defensible congruence of targets, methods, and outcome metrics to permit generalizable conclusions about cause and effect. As Thompson (1994) observes, “although meta-analysis is now well established as a method of reviewing evidence, an uncritical use of the technique can be very misleading” (p. 1351), with the risk for a heuristic review being presumably being more acute. Our premise is that as a matter of analytic rigor, we cannot plausibly pool the RCTs presented in Stevenson’s article for any type of meta-analysis, be it formal or heuristic. In other words, while there is an ability to conduct meta-analyses amidst certain narrow conditions of heterogeneity (Higgins & Thompson, 2002; Petitti, 2001), you cannot pool: (1) different types of RCTs to accumulate the power to draw a conclusion about all RCTs, (2) different interventions to draw results about interventions in general, and (3) different outcomes measured using different metrics to accumulate the power to draw a global conclusion about a class of intervention, and then use it to conclude something about the structure of the social world. Our first example from public health presented below is used to illustrate this concept.

RCTs in public health as an example of heterogeneous pooling

Suppose there are three random neighborhood-level assignment trials in which municipal departments of health failed to reduce asthma in Black communities, three individual-level random assignment trials in which they failed at intervening against obesity, and three randomized stepped wedge cluster trials where they failed at intervening against HIV. Barring trials of truly extraordinary size and scope, when considered as individual categories of intervention, these three sub-groups of results do not individually empower us to say that departments of health are inherently incapable of intervening on asthma, obesity, and HIV, nor do they permit us to say anything about the structure of the social world of Black communities. They neither have the power, nor provide the theoretical bases to do so.

Critical for our analyses here, we also cannot combine these studies to say that the nine studies, in aggregate, show departments of health are inherently incapable of intervening on the Black community’s public health problems, or that this empirical track record allows us to draw conclusions about the world. We cannot pool heterogeneous and independently underpowered trial design sub-types to draw this type of global conclusion.

Finally, the results of heterogenous interventions on heterogeneous targets with heterogenous outcome metrics should not be pooled to opine on the global effects of the overarching normative project (i.e., in this case, protecting the health of Black communities). In the above example, we would be pooling the subgroups of obesity, asthma, and HIV outcomes as intervention targets (Table 1). This would indicate pooled infection, weight, and lung function metrics somehow aggregate in a way that amplifies our means to generalize about our inherent ability to exert an effect on any of them, as well as all other health problems that are a part of the project. That is epistemologically impossible, hence the prohibition on doing so in meta-analyses. Having illustrated Stevenson’s problem, we next provide a criminal justice example.

A criminal justice example of invalid heterogenous pooling

Table 1 An example of invalid heterogenous pooling of public health studies for meta-analysis

Assume we have a few of the following randomized trials: Neighborhood-level trials about the effect of upgrading street lighting on street crime; stepped wedge cluster by work shift trials of a training to reduce police use of force as measured by suspect injuries; and individual-level randomizations of cognitive behavioral therapy to prevent probation violations among people with substance use disorders. Now, we would like to study the effects of linking people with substance use disorders to buprenorphine treatment rather than charging them with nonviolent misdemeanors, and we have proposed a precinct-level randomized trial to test the effect on subsequent overdose.

Under the methodology used by Stevenson, null results about street lighting, de-escalation training, and psychotherapy that respectively measure street crime, suspect injuries, and probation violations using neighborhood, stepped wedge, and individual randomizations could be aggregated to prospectively conclude that a precinct-level intervention about linkage to addiction medications will not reduce overdose because of what the null results of prior trials reveal about our social world (Table 2). Again, such pooling is invalid whether done formally or heuristically, and does not allow prospective meta-analytic predictions about the study design or effects of the proposed buprenorphine study.

The irrelevance of time frames in meta-analyses

Table 2 An example of invalid heterogenous pooling of criminal justice studies for meta-analysis

As a final note on methods, Stevenson frames her analysis as “50-Plus Years of RCT Evidence.” The subtitle is misleading. Timespans have no bearing on the results of meta-analyses, formal or heuristic, but may incorrectly imply that enough time must have passed for there to have been enough studies to draw a meaningful meta-analytic conclusion about them. In the public health example above, it makes no difference if the nine studies were conducted all at once across the nation, or over the course of decades. A paucity of related studies over 100 years, which could be presented as “A Century of RCT Evidence” may provide the same level of analytic insight as the same number of studies conducted over the course of a decade, and less insight than a high volume of studies conducted over the course of two years. Such a title may therefore bias a reader unacquainted with the relevant methods, and despite this 50-year span, Stevenson overlooked several trials that found replicable, significant results. That concern will not be fully addressed here.Footnote 8

The vast heterogeneity of criminal justice research

As argued above, the studies used by Stevenson to generate her conclusion cannot be pooled with the necessary meta-analytic rigor reflected in widely accepted protocols such as the Campbell Collaboration (Schuerman et al., 2002). As the principal source of the RCTs in Stevenson’s study, a review Farrington and Welsh (2006) takes up 122 RCTs over 50 years in five broad criminal justice fields.Footnote 9 She then supplements these studies with a selection of others. While each category of study is too small to offer confident global conclusions, the body of studies is far too heterogenous across all dimensions to permit their pooling to power generalizable results (Table 3). A more modest pooling of heterogeneous results in a recent meta-analysis evaluating one narrowly defined type of intervention resulted in the retraction of published findings (May et al., 2018). As a result, there is insufficient evidence to support the conclusions about the structure of the social world drawn in Stevenson’s article.

Table 3 Heterogeneity of studies used for heuristic meta-analysis in Stevenson, 2023 as compiled from Farrington and Welsh (2006)

Although the temporal span of a set of studies has no bearing on a meta-analysis of their overall effects, it does allow us to comment on the pace and scale of the research projects they constitute. Farrington and Welsh (2006) cataloged 122 RCTs over 50 years in five broad criminal justice fields; if each trial lasted about two years, this amounts to an average of about one active RCT per setting, per year, in the midst of a gargantuan and sprawling criminal justice system. It suggests that RCTs in criminal justice settings have been considerably underutilized, perhaps due to the limited funding opportunities for RCT research at the federal level (National Research Council, 2010).

Discussion

When a heuristic is used to make an empirical claim, the safeguards that promote and protect the validity of meta-analytic methods are evaded. These safeguards concern the inclusion criteria for detecting the universe of criminal justice RCTs, the criteria for excluding irrelevant or inappropriate ones, what steps will be taken to guard against selection and analytical biases, and why an exception is justified that would permit the pooling of highly heterogeneous studies and results. A protocol that pools every included RCT together to make a claim about the power of RCTs, regardless of their methods or outcome metrics, would have revealed fundamental flaws before the analysis was performed. However, protocol reviews are not typically undertaken by the type of legal journals where Stevenson’s article was published.

It may be the case that RCTs and other methods of strong causal inference have proven insufficient for improving our social world, especially in terms of criminal justice. That does not mean they are not necessary, or at least helpful (Sampson, 2010). When there is enough clinical equipoise (Freedman, 2017) to wonder whether one intervention is more or less effective than another, there are few better ways to answer that question than an RCT.Footnote 10 To assume that we already know what the comparative effects of different criminal justice interventions will be is—at our stage of human understanding—a view motivated by hubris or ideology, especially given that criminal justice research consists of disciplines widely regarded as immature sciences (Hibbert, 2016). With this in mind, we spend the remainder of our discussion surveying ways we might improve the quality of criminal justice research. We begin by demonstrating that, contrary to Stevenson’s assertions, the challenges of replicability and external validity facing scientists have been widely acknowledged for some time and pose challenges that hardly limited to criminal justice research.

The challenges of replication are widely known and acknowledged

Stevenson asks, “why aren’t my empirical claims more broadly known?”(p. 2046), and “shouldn’t the academics and policymakers working in this space know better?” Publishing in a law journal, Stevenson takes the time to walk nonspecialists through hypothesis testing and causal inference. More relevant scientific journals, however, assume this knowledge among their readership, and focus on the observation that there is a replication crisis in several fields of science, including many that conduct research in criminal justice settings. They include psychology, information science, business, finance, epidemiology, medicine, and public health (Anvari & Lakens, 2018; Coiera et al., 2018; Dreber & Johannesson, 2019; Hicks, 2023; Jensen et al., 2023; Lash et al., 2018; Oberauer & Lewandowsky, 2019; Pagell, 2021; Ryan & Tipu, 2022; Tackett et al., 2019). The same fields have also conceded that the quest for statistical significance born of publication biases pressures researchers to p-hack (Wooditch et al., 2020), or to HARK, i.e., “hypothesize after the results are known.” This latter phrase, coined by Kerr (1998), has since been cited in research over 2,200 times, and 1,180 of those times were since 2020. This is the reason the protocols for systematic reviews and meta-analyses are required to explain how they will account for such biases in their execution, and the rationale for requirements to pre-register protocols for clinical trials, examples of how science has responded to these problems.

It is also well known that RCTs often fail to produce significant results, most produce small effects, have limited external validity, and have a track record of failing under replication, perhaps an artifact of the very nature of null hypothesis significance testing (Lash, 2017).Footnote 11 Table 4 presents the prevalence of articles containing relevant search terms as indexed by PubMed, the National Library of Medicine’s repository of published health and medical research. The repository is an index of all papers published in 5,600 medical and life science journals, and peer-reviewed paper by a researcher whose efforts were funded at least in part by the National Institutes of Health. The second part of the table expands the search to Google Scholar, which is more inclusive. The scientific community is fully conversant in the framing theses underlying Stevenson’s argument.

The instructive relevance of medicine and public health

Table 4 Prevalence of publications referencing concerns about research quality and replicability

Readers may be tempted to continue implementing RCTs in criminal justice relieved that Stevenson’s threat to their work seems to have been neutralized. We believe that would be a mistake. Even if there is not a defensible empirical claim to support Stevenson’s conclusions, there is a broadly philosophical one that concerns norms, utility, and the distribution of scarce resources. It is reminiscent of an observation made by Kirsch (2016):

Philosophers are people who, for some reason—Plato called it the sense of wonder—feel compelled to make the obvious strange. When they try to communicate that basic, pervasive strangeness or wonder to other people, they usually find that the other people don’t like it. Sometimes, as with Socrates, they like it so little that they put the philosopher to death. More often, however, they just ignore him.

Readers should not ignore Stevenson’s article because they do not like it and have the privilege of not needing to pay attention. That a scholar has garnered considerable attention for the opinion that RCTs cannot work in criminal justice settings should be a call to consider why it was even facially plausible for Stevenson to make that claim. The answer may be because criminal justice research is still emerging as a true science.

The fact is, RCTs that intervene on human behavior have demonstrated effectiveness. We need only to look to medicine and health to see their power, two fields which Stevenson brackets off as possibly not susceptible to the same critiques she makes of criminal justice. “I want to reiterate that this is a claim about the nature of the social world and does not extend to physics or biology,” Stevenson says. “Medical research, for instance, has clearly shown that limited scope interventions (e.g., drugs or vaccines) can have large and widely replicable effects. Fields such as public health, which straddle medical and social sciences, may be exempt for similar reasons” (p. 2033). In other words, because these fields test the effects of interventional drugs, surgical or other clinical interventions, or the preventive power of vaccines, they are not relevant to the conversation because they concern petri dishes or mechanistic effects rather than our complex, messy, and resilient social world.Footnote 12

It is true, medicine has what we call basic science at its core: the study of how bodies biologically respond to vaccines, medications, and other interventions and treatments. As the Association of American Medical Colleges (n.d.) states, basic science

…focuses on determining the causal mechanisms behind the functioning of the human body in health and illness, and utilizes hypothesis-driven experimental designs that can be specifically tested and revised. More recently, “systems biology” has focused on understanding how complex systems arise from elemental processes. Once these fundamental principles of the biologic processes are understood, these discoveries can be applied or translated into direct application to patient care.

But this is only a portion of the research that takes place in medicine.Footnote 13 Note the reference to “application to patient care.” Before the first pills are ingested or the first scalpel pierces skin, a trove of medical research has evaluated the systems that brought the patient to that point and led to the decision to prescribe or operate. These decisions were not just made by consulting a decision tree or a diagnostic manual but were the outcomes of sprawling systems of public and private administration that operate based on malleable laws, policies, and decisions. The widespread criticism of our healthcare system, from its obscene costs and overwhelming billing procedures to its inequitable access to care, are signs that there is more that needs to be researched in medicine than basic science.Footnote 14

This is even more the case for public health, defined as the pursuit of reductions in morbidity and mortality at the population level, or “at scale,” that is, the level of analysis Stevenson holds up as the ultimate target. That public health has achieved notable successes is not because of the basic science that underlies many of its strategies and initiatives. The National Institutes of Health’s RePORTER website (www.reporter.nih.gov) catalogs the constituent institutes’ extramural research spending in its entirety, by individual study. Browsing the entries, each of which includes a “Public Health Relevance Statement,” would quickly dispel the misconception that public health is not operating waist- or chest-deep in our social world.Footnote 15

In public health, RCTs have been used to test and then scale behavioral interventions that are inherently messy and human, such as ones to help people stop smoking, disclose stigmatizing conditions such as HIV or mental illness without upending their lives, choose and adhere to taking pre-exposure drugs to prevent contracting HIV, grow old in healthier ways, and practice safer sex. If we want to understand how and why police officers use their discretionary powers to arrest and charge people when they could divert them to more effective alternatives, or why defendants drop out of diversion programs, then RCTs have a place in seeking answers for the same reasons they have a place in understanding why people do not use condoms or drop out of drug treatment programs.Footnote 16 From its inception in 1998, the evidence-based policing movement is a direct intellectual descendant of its counterpart in medicine, which likewise initially proved resistant to evidence-based change (Sherman, 1998), and still remains resistant (Greenhalgh et al., 2014). The Campbell Collaboration standards absent from Stevenson’s methods—which would have foreclosed its heterogenous pooling—were derived from the highly successful Cochrane Collaboration used in health research (Davies & Boruch, 2001). Research has found physicians often resist the effective implementation of evidence-based practices as a feature of their professional culture (Pope, 2003), and the time frame for implementation of evidence-based medical practices is thought to average 17 years (Morris et al., 2011). Lessening this resistance and shortening that timeframe spawned an entire field concerned solely with the effective and sustainable translation of research into practice, i.e., implementation science. Nothing like it exists in criminal justice research yet. We take that up below. Rather than a mythical engineer’s view of the world, the idea that there is something about medicine and public health that makes their more established research traditions incommensurateFootnote 17 with criminal justice research may be the central myth Stevenson unwittingly cites in her search for a path forward.

Implementation science, external validity, and outcome metrics

Stevenson deals almost entirely with the distinction between internal and external validity but leaves unexamined the reasons why an internally valid practice may lose its causal power when transported to other settings with different demographics, systems of governance, finances, and institutional and community cultures, as was the case with perhaps the most famous trial about police responses to domestic violence (Sherman & Berk, 1984; Sherman et al., 1992). Such threats to external validity are hardly limited to criminal justice but abound in medicine and public health as well. In response, these fields have developed implementation science, which concerns “the scientific study of methods to promote the systematic uptake of research findings and other evidence-based practices into routine practice” (Eccles & Mittman, 2006). Although Stevenson makes no mention of it, how to effectively translate evidence-based findings across settings has been the subject of over 9,100 articles in the health literature since the inception of implementation science in about 2006.Footnote 18 Perhaps this is because with the exception of a group of cooperative agreements funded by the National Institute on Drug Abuse (Belenko et al., 2022; Knight et al., 2016, 2022), there has been little incorporation of implementation science into the criminal justice field (del Pozo, et al., 2024). That seems to be changing, however. With an inaugural 2024 solicitation to fund implementation science demonstration projects and a multiyear Evidence to Action initiative, the present director of the National Institute of Justice has focused on partnerships between researchers and practitioners that emphasize implementation research as a critical component of effective science.

Stevenson’s analysis also does not distinguish between the efficacy and effectiveness of an intervention, which highlights two different veins of intervention research: one intended to establish internal validity, and the other that explores how to successfully generalize a valid model across settings with different contexts (Fagan et al., 2019). Implementation science considers a systematic approach to promoting external validity through hybrid trials (Landes et al., 2019). This not only measures the effectiveness of an intervention but also tests different approaches to implementation to determine which ones enhance effectiveness by ensuring fidelity—or, just as importantly—indicate where adaptations are necessary due to myriad differences between settings.

Finally, there is no mention of what the distal endpoints of justice research ought to be. For example, if they are predominantly health outcomes (that is, we measure how well criminal justice systems decrease morbidity and mortality, protect health, and foster resiliency at the community level), we may see the field migrate toward health research, which has been adjacent but more successful in some ways (del Pozo et al., 2021; Goulka et al., 2021). A more trenchant observation may therefore be that by adhering to a paradigm that identifies crime rates and recidivism as the principal outcomes, criminal justice sets up its research puzzles in ways that have historically disincentivized interdisciplinary work. As Kuhn observed over 50 years ago, such an approach can

…insulate the [research] community from those socially important problems that are not reducible to puzzle form, because they cannot be stated in terms of the conceptual and instrumental tools the paradigm supplies. Such problems can be a distraction, a lesson brilliantly illustrated by… some of the contemporary social sciences (p. 37).

The reason why these omissions matter is not because they provide a means to dismiss Stevenson’s argument, but because they provide a clearer understanding of the problems that concern it, and feasible ways to make the progress we all presumably desire. What it is missing from her analysis is a thorough grasp of the philosophy of science it alludes to at the end (Kuhn, 1970), which has wrestled with all of this before and provided a language for it. Contemporary researchers are tangling with these problems as we speak. There are several ways to describe what has happened in criminal justice without rejecting the possibility that the field can make substantial progress.

As evidence of this, the idea of the meta-analytic narrative review arose from a discussion of Kuhn’s philosophy of science in a paper cited nearly 8,700 times: “Diffusion of innovations in service organizations: systematic review and recommendations” (Greenhalgh et al., 2004). It takes as its premise that even when there is an evidence base for a practice, diffusing it in ways that maintain its external validity is a challenge, and we must be deliberate and rigorous or run the risk of foregoing the effects demonstrated in prior trials. Criminal justice institutions, in their immense scope and fragmentation, are precisely the types of organizations susceptible to these concerns.

The mechanistic fallacy of effective systemic change

Critics of an overly evidence-based approach to criminal justice should also acknowledge the significant obstacles posed by alternatives. One may be an overemphasis on mechanistic end states that portray a utopia. Stevenson clears the way for this point in observing that “…RCT’s tend to focus on questions that aren’t a priori obvious… One does not need an RCT to evaluate whether providing food to the hungry fills bellies… Outcomes that are the direct, mechanical effect of a reform of intervention are generally too obvious to fall within the scope of my claim” (p. 2035). This is akin to reminders that we do not need an RCT to know we benefit from using dental floss (Holmes, 2015), or using a parachute when we jump out of airplanes (Smith & Pell, 2003). The problem with this type of statement is that the more visionary and sweeping the proposed systemic reform is (i.e., as outlined on p. 2043), the more likely it is to devolve into a call for simple mechanical effects. The preconditions of police and prison abolition provide a suitable example. Andrea Ritchie and Mariame Kaba visualize abolition with this invitation: “Can we just imagine a world and build toward it where everyone has everything for everyone without any kind of policing, surveillance, or punishment” (McMenamin, 2023)? If we take this seriously, it is about bellies full of food, bodies with adequate health care and shelter, and people with enough money to live stable, secure lives. If a society can deliver these mechanistic end states, their thinking is that people would not have a reason to engage in criminal behavior, and that those who do would be suffering from medical and psychological conditions that a properly compassionate and supportive society can ameliorate without policing. There are few visions of reform more sweeping, and we do not need an RCT to understand how these things would benefit people.

The problem is that food only fills bellies after it has entered a person’s mouth, shelter only warms and protects a body after the person has taken refuge in it, and money only provides stability and security after it is in a bank account, wallet, or purse. To get food into mouths, people into shelter, and money into bank accounts in widespread, consistent, and resilient ways requires a type of social organization that is bedeviled with complexity. Scientific inquiry is an important means by which we learn how to overcome the resulting challenges. Eschewing it is like saying that the reason traveling by commercial airplane is by far the safest way to travel is because we have fully mastered using air to create lift under a wing. If airline safety is the result of exquisite “engineering,” then it includes engineering (universal, federalized, tightly managed and overseen) systems that guide and direct the multitude of complex human behaviors that keep planes in the air (Stolzer et al., 2023), including factors such as pilot training, adequate numbers of air traffic controllers, reducing human error in airplane manufacturing and maintenance, and airplane crew satisfaction, for example.

Calls for systemic reform unmoored from rigorous research bring to mind the cartoon by Sidney Harris, famed illustrator of science, shown in Fig. 3. Too often, sweeping social reorganization absent rigorous research and planning is just that: the presumption that a miracle will occur somewhere in the middle, be it the miracle of spontaneous social organization, Marxism, abolition, unregulated capitalism, or an ideological belief in any other type of revolutionary change that has yet to demonstrate success.

Fig. 3
figure 3

“Then a miracle occurs.”

Is criminal justice research a pseudoscience?

In 1783, the philosopher Immanuel Kant wrote Prolegomena to Any Future Metaphysics That Will Be Able to Present Itself as a Science. It remains one of the most important works of western analytic philosophy. A polemic, it was written in response to David Hume’s incisive critique of the very idea of causality. For Kant, Hume’s critique had a profound effect on his scholarship. In a statement revered among analytic philosophers, he said “David Hume… first interrupted my dogmatic slumber and gave my investigations in the field of speculative philosophy a completely different direction.” Over 240 years later, the epistemological bases and procedures of scientific inquiry have yet to be settled, even if they have become more arcane.

In tying together its big claims, Stevenson’s final footnote references The Structure of Scientific Revolutions by Thomas Kuhn (1970), the magisterial account of scientific knowledge that brought us the idea of the “paradigm shift.” It is meant to bolster the claim that the engineer’s view of the world is what constitutes the present research paradigm, and we should reject it: “researchers see their project as one of mapping the causal structure of the social world in order to help improve it. In other words, the engineer’s view persists because the engineer’s view forms the basis of the research paradigm” (p. 2047).

This is not what science means when it discusses Kuhnian paradigms. They are the actual causal maps that describe the world, not the research methods used to derive them per se. They can run into limits when extreme cases (such as speeds approaching that of light) produce anomalous results that do not reconcile with the accepted mapmaking rules, and a new scientific theory emerges to accommodate them (such as relativity theory). That new set of concepts and its corresponding language, which subsumes but is largely incommensurable with the old one, is the paradigm shift. This distinction between uses of the term “paradigm” is important to note because Kuhn’s only applies to mature scientific fields that falter when they take on the final, stubborn cases in their “mopping up operations” (Kuhn, 1970). Behavioral science has yet to coalesce to this level of maturity. To speak in terms of its paradigm is, ironically, premature. In this way, Stevenson’s note that Kuhn shows us that “progress within science can be limited by scope of current scientific paradigm” is technically correct, but it is in reference to well-developed explanatory paradigms that dominate and guide their fields of inquiry. There is no such thing in behavioral science at present.

Second, Stevenson’s characterization of “paradigmatic” scientific inquiry is generic to a degree that seems to scuttle science itself. What field of earthly science is not endeavoring to understand the relationships that describe the world, and to what extent they are generalizable, with the goal of putting that knowledge to use? In laying out this commitment, Kuhn (1970) observed that in addition to certain paradigmatic beliefs about the building blocks of their field,

…there is another set of commitments without which no [person] is a scientist. The scientist must, for example, be concerned to understand the world and to extend the precision and scope with which it has been ordered. The commitment must, in turn, lead him to scrutinize, either for himself or though colleagues, some aspect of nature in great empirical detail. And, if that scrutiny displays pockets of apparent disorder, then these must challenge him to a new refinement of his observational techniques or to a further articulation of his theories (p. 42).

To reject the engineer is to reject that project. Once you reject it, you may be an activist, reformer, revolutionary, or a daring innovator, but not a scientist. At the core of every scholarly endeavor is explaining how concepts can be used to express relationships between things. There may be another implication, however. Saying we have little reliable knowledge about the causal structure of the social world in justice settings seems to imply that criminal justice is a pseudoscience. It is what philosophers of science call an endeavor that persists in its ways despite providing little to no generalizable knowledge, and does not yield accurate predictions about its subjects (Hasson, 2021). If you follow the logic that criminal justice intervention research has produced little generalizable knowledge, and it is your belief that it simply cannot, then pseudoscience is the label for it. Criminal justice shares many traits with other social sciences, which raises the question as to whether Stevenson may be implying that sociology, psychology, public health, political science, and economics are pseudosciences as well. In this vein, it would be interesting to explore whether her call for sweeping reforms without adherence to rigorous research methods tracks theories of epistemological anarchism (Feyerabend, 2010).

Conclusion

If a formal method lacks the power and validity to demonstrate something, then there is no reason to believe the informal heuristic of that method can do so. The heuristic simply becomes an informed opinion or an intuition. Opinions and intuitions are of some importance in social inquiry; however, while they point the way for further study, and to the acute need for improvements in the scope, metrics, methods, and goals of criminal justice research, they cannot be used to reliably reveal the structure of our social world. We have ample evidence in other fields of research programs that include RCTs in a constellation of methods that have been successful at modifying the same types of complex human behaviors we seek to modify in criminal justice settings. We should approach those fields with curiosity and humility to see how we can improve rigorous causal research in criminal justice. Perhaps a more pertinent (and tenable) conclusion from Stevenson’s article is that it highlights how few RCTs have been conducted for a field that puts so much at stake for people and is so important for maintaining public safety, societal order, and public health.