The Joan McCord Award of the Academy of Experimental Criminology has been a great honor for me. Joan inspired much of my own work on criminological topics. She was a role model for me not only in experimental criminology but also in other areas, such as family relationships and juvenile delinquency (McCord 1991), child abuse (McCord 1983), psychopathy (McCord 2001), and resilience (McCord 1994). Joan was always curious, in science as well as in arts, history, politics, and all aspects of human life. When we once walked through a poor and perhaps dangerous neighborhood in Brazil, she emphasized how important it is to get an own impression. And she was always critical and precise in her evaluations. When we visited a new museum of modern art and design in Germany, she dryly commented that the building is wonderful, but the exhibition needs better objects. Her realistic and evidence-based attitude was particularly obvious in her work on the Cambridge–Somerville Youth Study (McCord 1992). Joan frankly reported that the long-term outcomes of this landmark prevention project were not positive and warned that programs can harm, in spite of best intentions (McCord 1978, 2003). This topic and attitude led me to choose the issue of replication as the theme of my Joan McCord Lecture.

The issue of replication in science

Replication of findings is a key issue of any empirical discipline (Popper 1959). Most recently, it became a hot topic when the Reproducibility Project in Psychology published its findings (Open Science Collaboration 2015). This large project, funded by the Laura and John Arnold Foundation, investigated whether the results of empirical studies in psychology are robust when tested in replications. The rationale of the study derived from widespread concerns in the discipline, such as selective data analysis, selective reporting, and insufficient specification of the necessary conditions to obtain a specific result. Numerous international collaborators carried out exact replications of 100 experimental and correlational studies that had been published in 2008 in three prestigious psychological journals. The results were sobering. Less than half of the effects in the original study could be replicated in quantitative terms and approximately one-quarter of effects went in the opposite direction. The mean effect size dropped from r = 0.40 in the original studies to 0.20 in the replications.

Some variation in psychological findings on a specific topic is normal due to sampling, situational, and other conditions. Although I held a chair of psychology over many years, I was always skeptical about studies that tested general hypotheses on human behavior in small student samples and artificial scenarios. However, the reproducibility issue is not only a problem of psychology. Ioannidis (2005) investigated replications of 49 highly cited studies (n > 1000) in medicine. Forty-five studies reported “effective” results, 44% could be replicated (but often with smaller effects), 16% were contradicted by subsequent studies, 16% got stronger results, and 24% remained unchallenged. A survey of Baker (2016), published in Nature, received answers of 1576 scholars from hard sciences (chemistry, biology, physics, engineering, medicine, earth and environment, and others). Fifty-two percent of respondents said that there is a significant reproducibility crisis, 38% stated a “slight crisis”, and only 7% denied a crisis. Across all disciplines, 62–87% of the respondents said that they could not replicate an experiment of somebody else and a slightly smaller proportion agreed that they could not replicate own findings (51–74%). When asked about how much published work in their respective field is reproducible, most answers ranged between 50% and 80%, but more than a quarter assumed lower rates.

The word “crisis” should not be used inflationary, but the reproducibility issue has been repeatedly emphasized in the social sciences before the recent alerting articles in Nature and Science. For example, already, Farrington (2000) noted that pure replications are too rare in criminological research. Flay et al. (2005), Valentine et al. (2011), Gottfredson et al. (2015), and others addressed standards of evidence that should reduce replication problems in prevention science. The strong need for more replication has been emphasized from a statistical perspective (e.g., Hunter 2001), but there are many social factors in research that form obstacles against a culture of replication. In criminology and other disciplines, the academic world reinforces mass publication (“publish or perish”). Researchers seem to avoid replications because they want to demonstrate their own creativity. Journals require “novelty” so that pure replications are hard to publish. Scholars assume that replications would get less academic recognition, although this may not be the case for falsifications of prominent hypotheses. Journal impact factors are often seen as more important than the real content of a paper. Research foundations tend to promote large collaborative projects, but these make replication more difficult. Although policies encourage open data access, scholars often hesitate to offer hardly gathered own data to others. Randomized experiments play a key role in the establishment of sound knowledge (Boruch et al. 2000), but they are not widely implemented in criminology (Farrington 2003). The Campbell Collaboration aims to provide best evidence by promoting measures of transparency in systematic reviews (Farrington and Petrosino 2001; Petrosino et al. 2001); however, in primary studies, such safeguards are still rare. In studies with many variables, selective reporting and fishing for significance is a widespread danger. In research areas with financial incentives, selective data analysis and reporting can be a serious problem (Eisner et al. 2015) and more neutral, independent evaluations are needed (Petrosino and Soydan 2005). Beyond financial issues, scientific networks may implicitly influence what is analyzed and published. Last but not least, there are time and resource issues that hinder replications in complex field experiments that require years of follow-up. Joan McCord’s Cambridge–Somerville Youth Study is an example for that, but there are many shorter criminological projects that would also be very difficult to replicate.

These and other influences on the reproducibility of research in social sciences are not new. For example, already, Rossi (1978) formulated the Iron Law of mean zero effects. Although he conceded that there were examples of positive results, he concluded that most social programs, when properly evaluated, are ineffective or only marginally accomplishing their aims. Rossi’s Iron Law focused on the mean, but the variance was likewise important because only consistent zero effects would advance the knowledge about what does not work. Crime prevention was a typical example at that time. Large systematic reviews of correctional treatment (Lipton et al. 1975; Sechrest et al. 1979) found many methodologically weak studies and inconsistent results that contributed to the impression of “nothing works”. Later, Rossi (1987) differentiated three “metallic rules” of program evaluation. The Stainless Steel rule meant that the better designed the evaluation of a social program, the more likely is a net impact of zero. The Zinc rule denoted that mainly programs that are likely to fail are evaluated. And the Brass rule said that the more social programs are designed to change individuals, the more their net impact will be zero.

In connection with the latter rule, the present article will address the replication issue by focusing on person-oriented criminological interventions, in particular on developmental prevention and offender rehabilitation/treatment. I selected these two topics because they are important policy areas and parts of my own research. My discussion will mainly focus on examples of criminological research in these fields. For more general issues of statistical, internal, construct, and external validity, see Shadish et al. (2002).

Replication in developmental prevention

Since Rossi’s critical view of the impact of social programs, there has been progress in the evaluation of criminological and related interventions. In developmental and life course criminology, early prevention has strongly expanded (Farrington et al. 2016; Farrington and Welsh 2007). Numerous universal or risk-based programs have been implemented in families, kindergartens, (pre)schools, family education centers, child guidance clinics, and other services. Although most programs that are implemented in practice are not evidence-based (Lösel et al. 2006; Mihalic and Elliott 2015), many sound studies have been carried out and integrated in systematic reviews. An overview of meta-analyses showed that the findings varied widely (Lösel 2012a), for example between a mean of d = 0.10 in a meta-analysis of school-based programs (Gottfredson et al. 2002) and d = 0.65 in a meta-analysis of parent trainings (Serketich and Dumas 1996), but all means were not zero as Rossi suggested 30 years ago. Most recently, Farrington et al. (2017) analyzed 50 systematic reviews of developmental and social programs that investigated outcomes of delinquency, offending, violence, aggression, or school bullying. Twenty-five reviews contained school-based programs, eleven individually focused programs, nine family-based programs, and five general prevention programs. Mean effect sizes were available from 33 syntheses and, with the exception of four, these were all statistically significant. The mean effects varied widely, that is, from an odds ratio (OR) of 1.08 (d = 0.04) in a meta-analysis of school programs (Wilson et al. 2001) to an OR of 3.19 (d = 0.64) in a meta-analysis of child-focused programs (Robinson et al. 1999). The average effect was significant for all four types of programs and the overall effect was OR = 1.46. According to Cohen (1992), this is a small effect (d = 0.21, r = 0.10), but it is, insofar, realistic, as most correlations between single early risk factors and later delinquency are significant, but low to moderate (Hawkins et al. 1998; Lösel 2002; Murray et al. 2010). An OR of 1.46 is also practically relevant: depending on the prevalence of behavior problems in a cohort, it could indicate a reduction from 20% to 15%, that is, of one-quarter (Wilson and Lipsey 2007). Since long criminal careers of young people are very costly (Cohen and Piquero 2009; Piquero et al. 2013), even small effects of prevention programs can be cost-effective (Aos et al. 2004; Welsh and Farrington 2015).

In contrast to the overall encouraging results, it is less clear how far the above findings are reproducible in daily practice. Valentine et al. (2011) thoroughly analyzed various scenarios (cases) of differences in the outcomes of two (or more) implementations of a specific prevention program. They addressed issues of the evaluation design, statistical assessment strategies, investigator independence, and other aspects of inconsistent results. In practice, such factors are often combined and difficult to disentangle. In addition, broader context issues have to be taken into account. For example, the majority of studies on developmental crime prevention stem from North America and often from demonstration projects. Replication within and across different countries cannot simply be taken for granted. Although some research suggests that basic characteristics of interventions can be generalized (Knerr et al. 2013; Koehler et al. 2013), other examples cast doubt on this assumption.

For example, various randomized controlled trials (RCTs) have been carried out on indicated prevention or early treatment by Multisystemic Therapy (MST; Henggeler et al. 2009). Most evaluations came from the United States and, often, the program developers were involved. They showed desirable and sometimes extremely strong effects (e.g., Borduin et al. 2009). Some independent evaluations in other countries found less or no positive effects, for example, Leschied and Cunningham (2002) in Canada or Sundell et al. (2008) in Sweden. Other independent evaluations outside the United States showed desirable effects of MST, for example, Ogden and Amlund Hagen (2006) in Norway or Asscher et al. (2014) in the Netherlands (although the latter not on official delinquency). Sundell et al. (2008) discussed potential reasons in the social welfare system that may have been relevant for the inconsistency in MST evaluation in Norway versus Sweden.

Beyond the cultural/social context, evaluation methods and selective reports seem to be relevant for different results on MST: a meta-analysis of the MST group found a substantial mean effect (Curtis et al. 2004); however, a systematic review by Littell (2006) raised concerns about the validity of various MST evaluations, particularly those by the program developers themselves. Littell objected that most positive effects reported in the articles from the Henggeler group were from post-hoc analyses of subgroups and/or on secondary outcome criteria. The mean effects were rather small and statistically not significant for a priori analyses of full sample results on primary outcomes. Henggeler et al. (2006) defended their findings; however, a more recent independent meta-analysis found a mean effect of MST that was lower than that of Curtis et al. (2004), although a little more positive than Littell (2006) reported (van der Stouwe et al. 2014). There was a significant effect on the primary outcome of delinquency, but numerous moderators played a role (e.g., country of origin, efficacy versus effectiveness, study quality, treatment duration, sample, and outcome characteristics). The above findings clearly show a substantial amount of variance between single evaluations that may not allow a general conclusion about the effectiveness of MST.

This situation is not rare in developmental prevention. For example, whereas Sanders et al. (2000) reported desirable effects of their Triple-P parenting program in Australia and a meta-analysis of Triple-P researchers showed mean positive outcomes (Nowak and Heinrichs 2008), independent research found no effect in Switzerland (Eisner et al. 2012). Eisner (2014) also questioned the results of a large-scale implementation of Triple-P in the United States and Sanders (2015) published a paper on how to deal with conflicts of interest. As for MST, details cannot be discussed here; however, obviously, there are, again, controversial findings on a widespread program.

In addition to replication across different studies, there are questions of generalizability when one takes a closer look at single evaluations. Even most studies using RCTs or sound quasi-experimental designs have rather short follow-up periods and do not address the issue of sustainability (Lösel and Beelmann 2003; Mihalic and Elliott 2015). Only a handful of evaluations worldwide have long follow-ups of about ten years or more (Farrington and Welsh 2013). Insofar, it remains unclear whether programs that intend to prevent a criminal development really reach this aim. There are a few exceptional studies with positive effects from childhood to adulthood (e.g., Schweinhart 2013; for some other studies, see below), but McCord’s (2003) study showed the other side of the coin.

Deficits in well-replicated, long-term findings are also reported from the Blueprints for Healthy Youth Development. This important registry established standards for evidence-based prevention, for example, at least two RCTs or sound quasi-experiments with positive results. Taking stock of the Blueprints, Mihalic and Elliott (2015) reported that more than 1300 prevention programs have been analyzed over time, but only 54 could be certified as model programs that fulfilled the criteria of solid evidence. Although the authors stated an overall progress, they emphasized that the number of model programs would be less than a handful if independent evaluation would be required as a criterion. They also noted that the Blueprint’s criterion of “sustained impact” is only at least 12 months. Many programs would not have been certified if a longer period had been demanded. In addition, the quality of model programs often deteriorates in practice (Gandhi et al. 2007) and effectiveness is typically lower than efficacy in demonstration studies (e.g., Weisz et al. 1995). Since evidence-based registries on what works are highly important (Gottfredson 2016), self-critical comments of pioneers in this field must be taken seriously. One should also be aware that various registries apply different criteria, so there is inconsistency with regard to what works or what is best practice (Fagan and Buchanan 2016; Gandhi et al. 2007).

Although researchers are aware of replications across studies, it is less recognized that there is a similar issue of outcome replication or consistency within single evaluations. These can be illustrated by findings from our own Erlangen-Nuremberg Development and Prevention Study. This project combined a prospective longitudinal and experimental study on kindergarten children and their families in Bavaria. In the prevention part, the universal program EFFEKT has been evaluated. It contains a program on positive parenting, child training on social problem solving, and a combination of both. The controlled design showed positive effects on externalizing behavior problems after 2–3 months, 2–3 years, and 4–5 years (Lösel et al. 2009; Lösel and Stemmler 2012). After about 10 years, there were still some significant desirable outcomes, that is, in boys’ self-reported property offending (Lösel et al. 2013). We also found various positive effects in shorter evaluations of the program in samples from deprived migrant backgrounds (Runkel et al. 2016) and families with emotional problems (Bühler et al. 2011). Overall, the project showed replicated effects, but the findings varied across different follow-up periods, outcome measures, and sub-programs. In some analyses, the child training had significant effects, while in others, the parent training, and more often the combined program, had better outcomes. We found desirable effects when the kindergarten nurses or school teachers assessed the child behavior, but not when the mothers were the informants. Some results also varied with regard to the kind of behavior problems. We could provide plausible explanations for these variations, but we are aware of the risks of post-hoc plausibility and fishing for significance.

Perhaps inconsistency in our findings may be partially due to the implementation of a relatively short universal prevention program. However, evaluations of more intensive programs also showed positive effects, as well as different findings across follow-up times, outcome measures, and subgroups (e.g., Asscher et al. 2014; Kellam et al. 2008). Similar observations have been made in some of the most prominent studies of intensive risk-based prevention, for example:

  • The Nurse Family Partnership program (Olds et al. 1998) supports at-risk mothers during pregnancy and the first two years after birth. A sound evaluation after 12 years showed significant desirable effects on partner relationships, health behavior, need for social care, and other outcome measures, but not on alcohol use and arrest (Olds et al. 2010). In a follow-up at age 19 years, only females had significantly less delinquency than the control group (Eckenrode et al. 2010), although various previous findings were significant for males.

  • The Family and School Together (FAST) prevention trial started at child age 6–7 years and lasted over 5 years. The program contained parent training, home visits, child social skills training, parent–child sessions, academic tutoring, peer coaching, and classroom management (Conduct Problems Prevention Research Group, CPPRG 2002). The RCT showed desirable short- and long-term effects on measures of children’s problem-solving, cognitive skills, and social behavior in various follow-ups, but there were also several nonsignificant and some negative outcomes (CPPRG 2004, 2010). At age 19 years, the program group had fewer official offenses (particularly in the highest risk group), but there were no significant effects on self-reported delinquency, which is normally more sensitive to change and had shown positive effects before.

  • The Montréal Prevention Experiment addressed high-risk 7- to 9-year-old boys from low socioeconomic families (Tremblay et al. 1995). It lasted about two years and included a program on adequate parenting and child social skills training. The RCT evaluation with two control groups showed no clear short-term effects, but significant effects after three years and later (e.g., less aggression and gang membership). After 15 years, more program participants had completed high school and fewer had a criminal record than in the control group (Boisjoli et al. 2007).

I referred to these three examples because their research quality is beyond any doubt. Many other prevention studies also found significant effects in some variables, at some times, and in some sub-groups (but not in others). Sometimes, there are decreasing effects over time, but, occasionally, also increasing effects (“sleeper effects”). Researchers provide sound reasons for the inconsistency in some of their results. However, as in the reproducibility discussion in psychology, these post-hoc interpretations are more based on plausibility than on prior hypotheses. In philosophy of science, this is known as “exhaustion”, that is, further conditions are added to the deductive-nomological model of explanation (Hempel and Oppenheim 1948). Perhaps practice may not be interested in philosophy of science; however, recommendations of a program without specified conditions may lead to disappointment when a model program is re-implemented without success. Specification of relevant conditions is essential in practice and its lack may be one reason why scholars argue against randomized experiments (Cook 2003).

Outcome variation is normal in social interventions such as developmental crime prevention. Therefore, meta-analyses are important to estimate reproducibility. As mentioned, they show overall positive, but very heterogeneous results. The mean effect sizes vary substantially and this is also the case for moderator variables. The variation may be due to different types of prevention (e.g., universal, selective, indicated), targets (e.g., child, family, school, or neighborhood), selection of primary studies, coding of variables, outcome measures, follow-up periods, methods of effect size calculation, fixed or random effect models of integration, and so forth. Meta-analyses revealed a broad range of significant moderators, but these are not identical in different syntheses. Some could be replicated more often than others; for example, larger effects in indicated prevention (at-risk groups), multimodal approaches, good program integrity, small samples, short follow-ups, and studies where the evaluators have been involved in the program development or implementation (e.g., Lösel 2012a; Lösel and Bender 2012). More specific moderators have been found for programs against school bullying (Farrington and Ttofi 2009; Ttofi and Farrington 2011). Effective programs include more positive modules, such as parent information, school meetings, schoolyard supervision, clear classroom rules, and disciplinary measures. However, in all these meta-analyses, one must bear in mind that the moderators are derived from different primary studies whose results are mainly short term and not yet well replicated.

Replication in correctional treatment

In the 1980s, we carried out a first meta-analysis on the treatment of adult offenders in German prisons (Lösel and Köferl 1989). Around the same time, Lipsey (1992a) published a much larger meta-analysis on the treatment of juvenile delinquents in North America. Both meta-analyses found an overall desirable effect, but the mean effect sizes were small (between about d = 0.10 and 0.20, depending on the method of analysis). There was much variation between the outcomes of different primary studies and both reviews showed various moderators of effect size.

Since the 1980s, there has been clear progress in correctional treatment (Bonta and Andrews 2017; Cullen 2013; Lösel 2012b; MacKenzie 2006). More sound evaluations have been carried out, the majority in North America and English-speaking countries. Systematic reviews and meta-analyses confirmed a mean desirable treatment effect on recidivism (Cullen 2013; Lipsey and Cullen 2007; Lösel 2012b; Wilson 2016). The mean effect sizes in most meta-analyses were positive (Wilson 2016). Compared to a recidivism rate of 50% in the control groups, Wilson (2016) estimated a mean reduction of about 10 percentage points due to treatment. Such moderate effects reduce victimization and can pay off in financial terms (Welsh and Farrington 2000). For the treatment of general and violent offenders, the typical mean effect sizes seem to be relatively homogenous (between d = 0.20 ± 0.10; Lösel 2012b). In sexual offender treatment, they are more heterogeneous, that is, ranging from d = 0.08 to 0.54 (Lösel and Schmucker 2017), with meta-analyses on the treatment of young offenders at the upper end (Reitzel and Carbonell 2006; Walker et al. 2004). In spite of such encouraging results, there is still controversy about the effectiveness of sex offender treatment. This is due to often not well-controlled studies, small samples, different treatments, heterogeneous offender types, various comorbidities, variation in outcome measurement, handling of dropouts, and a wide range of follow-up periods (Lösel and Schmucker 2017).

As in developmental prevention, some heterogeneity of findings is normal in correctional treatment. Accordingly, the “what works” literature aims to show what is most effective and what has weak or no effects. The Maryland Report on Crime Prevention required at least two studies with positive findings and designs that were at least at level 3 of the scale of methodological rigor (Sherman et al. 2002). This was a plausible criterion, but the pattern of results is often complicated. For example, the widely used “Reasoning & Rehabilitation” program showed a desirable effect in several studies, but no effect in various others (Tong and Farrington 2006). Similarly, sound evaluations of cognitive-behavioral programs for sexual offenders revealed positive effects in some studies, but zero effects and even negative tendencies in others (Lösel and Schmucker 2005; Schmucker and Lösel 2015). These and other examples suggest that information about a mean effect has very limited value for practice.

To increase effectiveness and reproducibility, Andrews et al. (1990) proposed the risk–need–responsivity (RNR) model of appropriate treatment that became widely used in practice. Treatment showed positive mean effects when all three RNR criteria were fulfilled (Bonta and Andrews 2017). The effect sizes decreased when fewer principles were met and became even slightly negative when no criterion was fulfilled. This pattern has been replicated in meta-analyses on general offender treatment (Bonta and Andrews 2017), sexual offender treatment (Hanson et al. 2009), and young offender treatment (Koehler et al. 2013). In addition to RNR, many recent offending behavior programs integrate research on desistance (Farrall and Calverley 2006; Shapland et al. 2012), natural protective factors (Lösel and Bender 2003; Lösel and Farrington 2012), and the Good Lives Model (Ward and Brown 2004; Ward and Maruna 2007). The impact of such enrichments on reoffending is not yet well evaluated, but they are in accordance with broader RNR models of “what works” (Andrews et al. 2011; Lösel 1995).

Replicated moderators in meta-analyses play a key role in the explanation of heterogeneous treatment outcomes (Lipsey and Cullen 2007; Lösel 2012b). For young offender treatment, effects were larger in programs with a cognitive-behavioral concept, adherence to RNR, fidelity in implementation, ambulatory treatment, good descriptive validity, smaller samples, and demonstration projects (Koehler et al. 2013). Although there were more moderators by trend, the number of primary studies was too small for an adequate analysis. The same problem appeared in our recent meta-analysis of sexual offender treatment (Schmucker and Lösel 2015). The mean finding of 10.1% sexual recidivism in the program groups and 13.7% in the control groups was moderated by various factors. Studies with cognitive-behavioral treatment, small samples, medium- or high-risk offenders, more individualized program delivery, and good descriptive validity revealed better effects. In contrast to treatment in the community, prison programs showed no significant mean effect. These findings suggest that general statements about the effect or failure of sex offender treatment are inappropriate. It is plausible that sexual offender treatment in prisons is less effective (as compared to the respective control groups) because there is no reality testing for child molesters or internet offenders in custody. However, this is not a sufficient explanation because treatment in forensic hospitals had a slightly better and significant effect (Schmucker and Lösel 2015). General criminogenic effects of incarceration (Durlauf and Nagin 2011; Nagin 2013) must also be considered. However, this explanation may not be sufficient because drug-addicted offenders seem to benefit from a closed institution (Lösel and Koehler 2014).

As in developmental prevention, evaluations of correctional treatment often contain some inconsistency within one and the same study. For example, Lösel and Pomplun (1998) carried out a matched-pairs evaluation of an educational program as an alternative to remand incarceration of young offenders. The findings were mixed. For example, we found nonsignificantly lower rates of any recidivism in the control group, but significantly lower rates of serious recidivism in the treatment group. We felt that this result was plausible and it now fits to current knowledge on larger treatment effects at medium to high risk than at low risk (e.g., Travers et al. 2013). However, did we really have a hypothesis on this differentiated result at the time of our study?

Studies on sex offender treatment also vary in their findings and this cannot simply be attributed to program or design differences (Lösel and Schmucker 2017; Schmucker and Lösel 2015). As mentioned above, custodial programs had no significant effect on the rate of sexual recidivism, but various studies suggest that there may be an impact on other outcomes, such as a lower rate of nonsexual reoffending, or more delayed or less harmful sexual reoffending (e.g., Olver et al. 2012; Smid et al. 2016). Evaluations of sexual offender treatment often raise more questions than answers (Grady et al. 2015). There are plausible theoretical, statistical, or practical explanations for the mixed pattern of results, but these should not only be provided post-hoc, but also form differentiated models of conditions under which treatment is successful.

Some scholars may argue that inconsistent results in offender treatment studies are due to a lack of theoretical foundation. This may only be partially true. Many programs are based on sound social learning or criminological theories (Bonta and Andrews 2017). Others apply more differentiated, eclectic, and case-oriented approaches to treatment that are supported by general research on psychotherapy (e.g., Beutler et al. 2016). The processes of individual change are more complex than the typical 3–5 group trajectories of correlational studies in developmental criminology (Jennings and Reingle 2012). Research is complicated by low correlations between theoretically meaningful proximal measures of therapeutic impact and their relation to later recidivism (Lipsey 1992b; McDougall et al. 2009; Woessner and Schwedler 2014). There are issues of social desirability and impression management in psychometrics, low base rates of reoffending (e.g., for sexual offenses), poor sensitivity of dichotomous recidivism criteria in official crime data, and other methodological factors. Sometimes, a theoretically meaningful explanation of heterogeneous findings can be as challenging as nailing a pudding on the wall.

Discussion and perspectives

It is the fundamental role (and privilege) of scientists to be neutral and to tell the truth as far as they know it. This includes being self-critical. Following the legacy of Joan McCord, my lecture aimed to raise some problems of reproducibility in criminology. To avoid misunderstanding, it should be stated that criminology has made substantial progress in the fields of developmental prevention and correctional treatment. However, a realistic evaluation suggests that more differentiated and well-replicated findings are necessary. Would any criminologist drive over a bridge when s/he has been told that “on average” such bridges are solid, but 10% collapsed in a certain time period? Of course, it is not fair to compare criminology with engineering or the natural sciences, and the above introduction has shown that reproducibility is even a problem in these disciplines. The topics of this article are more similar to medicine, where many cures have limited effects, but no better alternatives are yet available. In the bridge analogy, people would perhaps drive over the risky construction if they have an urgent reason and know that nearly all collapses happened at times when there were overloaded trucks, heavy storms, and extreme temperatures. This would be an example for asking about the conditions under which a scientific explanation is more or less valid or an intervention is more or less justified.

The above sections have shown that there is much similarity in the findings on developmental crime prevention and offender rehabilitation. Not only the typical mean effect sizes but also large outcome variations are similar and suggest that the topic of reproducibility is relevant for criminology. Replication problems may be partly due to the complex longitudinal field experiments on both topics. Since there are rather consistent as well as inconsistent findings, I would not speak of a “reproducibility crisis” as it is discussed in psychology. However, obviously, there are problems of replication in criminology and these may not be limited to the two areas that are addressed in this article.

It would be worthwhile to analyze problems of reproducibility in other fields of criminology, for example, in the research on the origins of crime. For example, research on prominent theories like that on self-control has shown overall supportive but very heterogeneous results (e.g., Lösel 2017; Pratt and Cullen 2000; Walters 2016). More generally, Weisburd and Piquero (2008) found that the explanatory power of criminological theories is often low and leaves 80–90% of variance unexplained. More crime-specific theories showed somewhat stronger explanatory power than individual-based models. Accordingly, some research suggests that there may be superior effects of situational crime prevention (Clarke 1997) or place-based hot spots policing (Weisburd et al. 2008). However, situational and police-based crime prevention contain rather different programs. Although there are overall positive effects, systematic reviews vary substantially in their outcomes (Bowers and Johnson 2016; Telep and Weisburd 2016). In principle, situational crime prevention seems to contain similar problems of reproducibility as person-oriented approaches.

One should not polarize too much between both types of prevention, which are heterogeneous in themselves. Since a small group of persistent offenders is responsible for about half of all crimes (e.g., Farrington et al. 2006), person-oriented prevention and treatment of criminality is highly important. It should also be taken into account that situational crime prevention often refers to group/population data, whereas most person-oriented approaches use outcomes of single individual acts (e.g., recidivism) instead of more adequate aggregated behavior (Epstein and O’Brien 1985). Individual propensities and situational factors interact (Wikström et al. 2012) and, often, situation-oriented prevention also requires differentiation. A typical example is prevention through CCTV, where the outcomes differ between countries, crime types, implementation contexts, and combinations of measures (Welsh and Farrington 2009).

These and other examples suggest that the issue of differentiation in replication is not only relevant for developmental prevention and offender treatment. Rossi (1978, 1987) rightly emphasized the importance of methodologically sound evaluations and criminologists repeatedly have underlined the need for more RCTs (e.g., Farrington 2003; Weisburd 2010). The Academy of Experimental Criminology, the ASC Division of Experimental Criminology, and the Campbell Crime and Justice Collaboration promote this aim. However, although criminology would benefit from more RCTs, the above-mentioned findings on replication in psychology and the natural sciences have shown that more experiments alone will not solve the reproducibility problem. Meta-analyses on person-oriented prevention programs revealed large differences in the outcome of RCTs on the same or very similar programs (Lösel and Beelmann 2003; Schmucker and Lösel 2015). Randomization enhances internal validity, but in comparison to other fields of criminology (Weisburd et al. 2001), it is not consistently correlated with effect sizes in treatment evaluations (Lipsey and Cullen 2007). RCTs are also vulnerable in studies with small samples, selective dropout, experimental rivalry, program diffusion, weak outcome measurement, and other threats to validity (Lösel 2007; Shadish et al. 2002).

Beyond the overall design quality, there are numerous influences on the outcome of program evaluations. In the field of correctional treatment, Lösel (2012b) integrated characteristics of programs, offenders, contexts, and evaluation methods in a model of influences on the effects. A slightly modified version is shown in Fig. 1.

Fig. 1
figure 1

A model of factors that may influence the effect of offender treatment programs

Most of these factors are empirically supported by meta-analyses or single studies. Very similar influences seem to be relevant for the outcome heterogeneity in developmental prevention (Lösel 2012a). Not all of these moderators are yet empirically well founded and equally relevant. For example, the context “custody vs. community” is normally not relevant for developmental prevention, whereas personality traits of the target group are more important in correctional treatment.

The model in Fig. 1 contains many empirically relevant factors. Although it is not a theory, it is obviously not in accordance with William of Ockham’s (1287–1347) suggestion to keep explanations as parsimonious as possible. If one takes also principles of implementation science (Fixsen et al. 2009) into account, there could be more than 30 factors that are relevant for outcome. Although this would reflect the complexity of intervention, it leads to an information overload for practice, policy, and research designs. Therefore, I propose to select and test only the most relevant moderators in the respective field of intervention. Basic research on the human capacity for information processing found the magical number of seven, plus or minus two (Miller 1956). Perhaps this figure is a proper starting point for differentiations that are robust in replications.

Research on the most important influences on outcome heterogeneity will need fine-tuning. Following Rossi’s metallic metaphors, something like a Tin Can Law would be a suitable analogy. It assumes some solid material in empirical findings, but one needs to squeeze it into an adequate shape to explain outcomes and guide future interventions. In areas with a substantial number of primary studies, meta-analyses play a key role in this process. They should be systematically reviewed for moderators that are most well replicated, for example, (1) a multimodal concept, (2) sound theoretical foundation, (3) integrity in delivery, (4) staff competence, (5) a favorable social context, (6) medium- to high-risk target groups, and (7) a not too large roll-out that allows proper monitoring. These characteristics should then be included and systematically tested in sound primary studies. As in multicentered treatment research in medicine, these primary studies should be designed as a series of replications to test the reproducibility of findings (e.g., in a meta-analysis). The evaluation of a restorative justice program by Sherman et al. (2015) is a good example of this strategy.

Research on differentiated knowledge about reproducible findings needs to be embedded in the general framework of enhancing replication: empirical studies should adhere to the recommendations and guidelines for sound and replicated evidence that have been made in various contexts; see The Steering Group of the Campbell Collaboration (2016), the standards of the Society of Prevention Research (Gottfredson et al. 2015), the recommendations of the Reproducibility Project in Psychology (Open Science Collaboration 2015), and the CONSORT standards of reporting (Hopewell et al. 2008). Only a few issues can be mentioned here: evaluation studies should not only be carried out by program developers but also by independent researchers. The reason for this is not that one must assume intentional misconduct of program owners; however, there are various decisions in a research process that may provide a more or less “unconscious” influence (e.g., subgroup allocation, definition and coding of variables, aggregation of data, significance testing, selective reporting). Empirical studies should be preceded by research protocols that would enhance transparency and reduce selective post-hoc reports on results. Researchers should mention their main hypotheses about expected findings. There should be replicated outcomes according to explicit criteria. As far as possible, studies should use multiple indicators of a construct, different informants, measurement times, sensitivity tests, and other techniques that allow an estimation of generalizability. This is also necessary with regard to the respective target population. Criteria of “sufficiently” replicated evidence should be explicit and harmonized in different registries. Findings of multiple evaluations should differentiate between efficacy in demonstration projects and effectiveness in routine practice.

These and other guidelines to promote, or at least estimate, reproducibility are not new and well based on evaluation methodology (Shadish et al. 2002). However, one should be aware that such standards are more easily requested than realized in the daily practice of research. Promoting replication studies is a stepwise process that requires adequate funding and dissemination strategies (Valentine et al. 2011). From a realistic perspective, it should also be taken into account that applied research in criminology often has an exploratory character. Most of these studies are not RCTs, but researchers aim to use the best quasi-experiment under given circumstances. Of course, I do not recommend low methodological standards, but too uniform and rigid guidelines may ignore the need for flexible strategies in the real world that Campbell (1969) has so well outlined. However, in any case, the respective research reports should adhere to the above-mentioned standards of reporting; for example, not only highlight positive results, but also provide information on zero or negative findings as well. To ensure transparency, any kind of study should report sufficient details not only on the methods, but also on institutional issues and potential conflicts of interest.

Meta-analyses play a key role in research on replication. Similar to the method of confirmatory factor analysis, there should also be approaches such as confirmatory meta-analyses to validate post-hoc findings on moderators. These would dig deeper into the conditions of program success or failure (Schmucker and Lösel 2011; Shaffer and Pratt 2009). These analyses could establish broader principles of “what works” instead of a too narrow focus on isolated programs (e.g., Beelmann 2012; Lösel 2012b). The extended RNR model (Andrews et al. 2011), the (recently modified) criteria of the Correctional Services Accreditation and Advice Panel of England and Wales (Maguire et al. 2010), and the revised standards for prevention programs (Gottfredson et al. 2015) contain such broader issues of moderating conditions. Unfortunately, research on moderators is very challenging (Lipsey 2003). Many moderators are confounded, interaction effects are difficult to replicate, and, often, there are not enough studies for sound (multivariate) analyses. Statistical criteria for outcome heterogeneity can avoid artifacts (Hunter and Schmidt 2004), but they cannot replace theoretically meaningful hypotheses.

More replicated research on moderators in program evaluations would make an important contribution to validate differential effects, that is, provide answers to the question of what works for whom, under what conditions, with regard to what outcomes, and why. There is also a need for more data on the impact of combinations of programs or of specific program elements or modules (e.g., Hawkins et al. 2008; Lipsey 2009). In demonstration projects, programs are typically evaluated in isolation, and this is most suitable for RCTs or sound quasi-experiments. In practice, however, programs may have different components or are combined with other interventions (e.g., cognitive-behavioral therapy, basic education, employment programs). In custodial offender treatment, for example, programs are more effective when they are combined with adequate measures of aftercare (e.g., Maguire and Raynor 2006). Evaluations of program packages are methodologically more difficult than those of isolated interventions. However, clinical pharmacy shows the need for this type of approach: when patients receive various medications, it is important to know the effect of combinations that may potentiate effectiveness or sometimes lead to negative side effects. Criminological program evaluation can also learn from engineering or climate research, where specific factors often have a minor effect in isolation, but, in combination, they may show a strong impact. Of course, it is always more easy to say what should be done than carrying this out in research practice. However, I hope that I have shown both challenges and pathways of how developmental prevention, offender rehabilitation, and related areas can produce more well-replicated and differentiated results that are useful for practice and policy-making.