Introducing EMMIE: an evidence rating scale to encourage mixed-method crime prevention synthesis reviews

This paper describes the need for, and the development of, a coding system to distil the quality and coverage of systematic reviews of the evidence relating to crime prevention interventions. The starting point for the coding system concerns the evidence needs of policymakers and practitioners. The proposed coding scheme (EMMIE) builds on previous scales that have been developed to assess the probity, coverage and utility of evidence both in health and criminal justice. It also draws on the principles of realist synthesis and review. The proposed EMMIE scale identifies five dimensions to which systematic reviews intended to inform crime prevention should speak. These are the Effect of intervention, the identification of the causal Mechanism(s) through which interventions are intended to work, the factors that Moderate their impact, the articulation of practical Implementation issues, and the Economic costs of intervention. Systematic reviews of crime prevention, and the primary studies on which they are based, typically address the question of effect size, but are often silent on the other dimensions of EMMIE. This lacuna of knowledge is unhelpful to practitioners who want to know more than what might work to reduce crime. The EMMIE framework is intended to encourage the collection of primary data regarding these issues and the synthesis of such knowledge in future systematic reviews.


Introduction
The volume of research relating to crime prevention is enormous, but of varying quality. Policymakers and practitioners who want to improve their decisions by drawing on evidence thus face a variety of problems. These include, for example, finding the evidence, assessing its quality, working out which evidence is relevant to their issues, and persuading stakeholders that policy and practice should accord with what the evidence suggests.
Systematic reviews (SRs) have emerged as a method for finding, sifting, sorting and synthesizing the findings of primary evaluations relevant to particular interventions. Methods have been developed for the conduct of SRs, including the process of selecting studies for analysis, and the statistical meta-analytic procedures used to summarize the overall impact(s) of intervention. Despite this, just like the primary evaluations on which they are based, SRs vary in quality and do so in ways that should be considered by those involved in evidence based policy.
Building on earlier work concerned with primary studies (Perry et al. 2010;Sidebottom and Tilley 2012), this paper focuses on the assessment of the evidence quality of SRs, and provides guidance for the conduct of future ones. Measures of effect size are discussed, but consideration is also given to other dimensions of importance to practitioners-the intended primary consumers of SRs. These include what an intervention actually comprises and the ease with which it can be implemented. While the work reported is primarily focused on SRs, many of the issues are equally germane to primary studies.
In what follows, we first consider existing efforts that have provided the means to assess the quality of evaluation evidence. We draw on research from public health and medicine as well as crime prevention. Next, we consider what practitioners need to know. SRs that most adequately attend to all of the issues of importance will be more valuable to practitioners and so we present a rating scale 1 designed to enable the systematic assessment of the quality of SRs of crime prevention initiatives, and to inform future ones.
Existing scales for assessing the evidence base Scholars have noted that evaluations and clinical trials vary in quality, and that their reporting is often incomplete (e.g., Adetugbo and Williams 2000;Perry et al. 2010). In response, efforts have been undertaken to produce guidance regarding the conduct of primary studies (e.g., the CONSORT statement: Schulz et al. 2010;STROBE: von Elm et al. 2007), and SRs of them (e.g., AMSTAR, GRADE, PRISMA, RAMESES). In criminology, the Maryland Scale (Sherman et al. 1997) was developed to gauge the strength of individual studies according to their methodological rigor. It represents a 5level hierarchy of evaluation evidence intended to indicate the extent to which an evaluation is able to rule out forms of bias as alternative explanations to a program effect. That is, it speaks to the issue of internal validity (see Campbell and Stanley 1963). It says little, however, about the level of detail that authors should report about primary evaluations conducted (i.e., their 'descriptive validity', see Gill 2011).
In other disciplines, more effort has been invested in the provision of such guidance. In the case of primary studies, Moher et al. (2010) report the most recent incarnation of the CONSORT instrument. CONSORT 2010 is a 25-item checklist that focuses on the reporting of randomized controlled trials (RCTs). It primarily focuses on the extent to which study conclusions can reasonably be attributed to the treatment investigated (i.e., internal validity). The Cochrane Risk of Bias Scale (Higgins et al. 2011) also considers such issues, paying particular attention to the blinding of treatment providers, recipients and analysts, and problems with placebos. While systematic, these scales are silent on other types of validity (Sidebottom and Tilley 2012).
Apropos SRs, the AMSTAR (Shea et al. 2007), GRADE (Guyatt et al. 2008) and PRISMA (Moher et al. 2009) guidelines were developed to facilitate the assessment of the methodological quality of conducted studies (see also Higgins and Green 2011). Like the checklists for primary studies, however, they tend to focus on issues of internal validity.

Beyond internal validity
In their review of 302 meta-analyses of evaluations of diverse psychological, educational and behavioral treatments, Lipsey and Wilson (1993) concluded that: The proper agenda for the next generation of treatment effectiveness research, for both primary and meta-analytic studies, is investigation into which treatment variants are most effective, the mediating causal processes through which they work, and the characteristics of recipients, providers, and settings that most influence their results. Wilson 1993: 1201) Others have made similar suggestions (e.g., Cartwright and Hardie 2012;Rosenbaum 1988). The process of 'realist evaluation' and review has attempted to address such issues more directly (Pawson 2006;Pawson and Tilley 1997) and speaks to this agenda. In particular, realist studies explicitly focus on the causal 'mechanisms' through which interventions bring about their effects, the 'contexts' or conditions needed for treatments to activate potential causal mechanisms, and the 'outcomes' realized by the activation of causal mechanisms in the conditions in which they are introduced. What are produced in realist evaluations and reviews are Context, Mechanisms, Outcome pattern Configurations (CMOCs). This provides a framework for thinking about things other than effect size and factors that SRs might address.
To illustrate the importance of this, consider that interventions may bring about their effects in various ways. One example is the variation in mechanisms through which CCTV might reduce crime in car parks. These include, for example, the 'caught in the act' mechanism which leads to specific deterrence and incapacitation of the offender; 'you've been framed', where the offender perceives an increased risk; and 'memory jogging', where the presence of cameras reminds users to take precautions (Pawson and Tilley 1997: 55-82).
Crucially, the mechanisms being activated will depend on the particular conditions of the car park. For example, 'memory jogging' can only occur when the cameras are positioned in observable places, and the 'you've been framed' mechanism will only be activated if offenders can see the cameras or are aware of them. SRs and primary evaluations alike can only tease out the possible mechanisms through which interventions work by articulating 'logic models' of how they might do so and collecting the necessary data to test them. SRs will, of course, be limited by what can be found in primary studies, but they should explicitly seek to locate such information, and indicate the absence of information as well as synthesize what is available.
In the case of SRs, the nearest counterpart to realists' mechanisms and contexts are meta-analysts' 'mediators' and 'moderators'. Mediators describe the chains of events (or intermediate outcomes) that occur between a treatment and the ultimate outcomes produced. In our CCTV example, mediator variables that might be used to test for activation of the 'caught in the act' mechanism include the volume of offenders identified on CCTV footage, and the number subsequently prosecuted. In the absence of evidence that offenders had at least been identified on CCTV footage, this mechanism would not represent a plausible explanation for any impact observed. Such data should not be difficult to obtain in primary evaluations, and systematic reviewers should have no difficulty in determining whether chains of causality have been explored in primary studies.
While the checklists discussed above are silent on these issues, the SQUIRE guidelines (Ogrinc et al. 2008), developed to inform primary studies of quality improvement in healthcare, are not. The authors draw on the realist approach (see also RAMESES: Wong et al. 2013) suggesting (for example) that primary studies should Bdescribe the mechanisms by which intervention components were expected to cause changes, and plans for testing whether those mechanisms were effective^(p. 65). The SQUIRE guidelines thus represent a useful complement to those that focus on issues of internal validity. However, such guidance has yet to be incorporated into advice for the conduct or rating of SRs-the focus of this paper.
Moderators are equally important. They refer to variables that may explain variation in outcomes across different studies. They can include circumstances associated with differences in the efficacy of the intervention, such as the type of location. For example, CCTV may work more effectively in contained environments (e.g., car parks) than in open spaces (e.g., town centers). They can also include the study methods employed. For example, weaker effect sizes may be reported for RCTs than quasi-experimental studies (Weisburd 2010). While SRs typically consider the latter type of moderator, more attention could arguably be given to the former.
As suggested by Lipsey and Wilson (see also Cartwright and Hardie 2012;Weisburd et al. 2015), to better inform policy, the evidence base needs to speak to how interventions work and where and when they might do so most effectively. Consequently, when assessing the quality of the available evidence, in addition to considering the extent to which evaluations manage to rule out biases that might distort estimates of effect size, we also need to gauge the extent to which they contribute to understanding of the contexts/moderators relevant to the activation of the mechanisms/mediators that produce variations in outcome across differing sub-groups.
Despite their focus on internal validity, the CONSORT and SQUIRE guidelines for primary studies include items on the implementation of interventions, asking whether they provide sufficient detail to allow replication elsewhere or to determine whether they will be suited to particular situations. In a clinical trial, this would include the dose of drug, and how and when it was administered. This is encouraging as implementation is rarely straight forward, but we suggest more is required.
Finally, because practitioners have limited budgets, resourcing one intervention means that something else must be forgone. Moreover, the most effective intervention tested will be of little practical value if it is prohibitively expensive to implement or maintain. Thus, to make good decisions, policymakers and practitioners need information on the overall costs and benefits of particular interventions and their alternatives. Current guidelines are typically silent on these issues.

The EMMIE framework
The preceding discussion suggests that the adequately evidence-equipped policymaker and practitioner need to know the following about interventions they might want to implement: E the overall effect direction and size (alongside major unintended effects) of an intervention and the confidence that should be placed on that estimate M the mechanisms/mediators activated by the policy, practice or program in question M the moderators/contexts relevant to the production/non-production of intended and major unintended effects of different sizes I the key sources of success and failure in implementing the policy, practice or program E the economic costs (and benefits) associated with the policy, practice or program.
Both primary evaluations and SRs may attend to each of these more or less adequately. In assessing the evidence, it is thus important to differentiate between what the evidence suggests (e.g., an estimate of effect size) and the quality of that evidence (e.g., the methodological adequacy of the studies on which the estimate is based). With respect to assessing evidence quality, a key question concerns how meticulous the reviewers were in attending to each dimension. In the next sections, we discuss each in turn. As noted, we focus on SRs. We do so as their intended purpose is to synthesize evidence on treatments-an exercise which can provide practitioners with a good starting point in selecting interventions.

E -Effects: overall effect direction and size
The importance of producing unbiased estimates of mean effect sizes in SRs has been discussed elsewhere. For brevity, Table 1 summarizes the features of SRs that should be attended to in high-quality studies. To these we add (in the final row of Table 1) the assessment of unanticipated outcomes (e.g., quantification of crime displacement or a diffusion of crime control benefit, see Johnson et al. 2014). Table 2 lists the types of evidence (referred to as 'EMMIE-E') that should be included in an SR to inform understanding of an intervention and on which assessments of quality should be based. In terms of assessing the quality of an SR on effect size, we suggest that the issues identified should inform a five-point scale as shown in the third column of Table 2 ('EMMIE-Q'). Table 3 lists the individual items that inform the EMMIE-Q summary rating.  (2001) Sufficient assessment of the risk of bias (at least two necessary for sufficient consideration)* -Assessment of potential publication bias -Consideration of inter-rater reliability -Consideration of the influence of statistical outliers Hedges and Vevea (1996) Stock et al. (1982) Huffcutt and Arthur (1995 Attention to the validity of the constructs, with only comparable outcomes combined and/or exploration of the implications of combining outcome constructs* Petticrew and Roberts (2006) Assessment of the influence of study design (e.g., separate overall effect sizes for experimental and quasi-experimental design) Bryant and Wortman (1984) Assessment of the influence of unanticipated outcomes or spin-offs on the size of the effect (e.g., quantification of displacement or diffusion of benefit) Items highlighted with an (*) symbol are particularly important for the EMMIE-Q rating (see Table 2)  Table 1) 1: Sufficient consideration of one *element of validity 2: Sufficient consideration of two *elements of validity 3: Sufficient consideration of three or four *elements of validity 4: Sufficient consideration of five or six elements of validity (including all of those marked with an '*')

Mechanism/ mediator
Map of possible mechanisms/logic maps A priori mediator or mechanism-based moderator analysis Post hoc mediator or mechanism-based moderator analysis Assessment/statements of most likely mechanisms and any contextual conditions (these can be narratives) 0. No reference to theory; simple black box 1: Broad statement of assumed program theory stated (mechanisms and/or processes) 2: Detailed articulation of theory, based on interrogation of relevant literature and/or elicited from practice. 3: Formalization of theory and derivation of precise predictions from it 4: Test, corroboration, falsification and refinement of theories, using data assembled for the purpose.
Moderator/context A priori context-based moderator analysis/ subgroup analysis (analysis testing the differences that context makes to outcome; theoretically driven) Post hoc context-based moderator analysis/ subgroup analysis (analysis testing the difference context makes to outcome; conducted due to data availability/not theoretically driven/ not mentioned prior to analysis) Statements qualifying contextual variations (these can be narratives) 0: No reference to condition contexts or moderators that may be significant for activation of mediators or mechanisms 1: Ad hoc description of possible relevant moderators or contexts 2: Tests of the effects of moderators or mechanisms defined post hoc using variables that are at hand 3: Theory-based pre-specification of expected moderators and mediators relevant to the activation of mediators or mechanisms 4. Collection and analysis of relevant data relating to the pre-specified expected moderators and contexts.

Implementation
A list/statement of key components necessary for implementation of reviewed interventions A list/statement of key components deemed necessary for replication elsewhere 0: No account of implementation or implementation challenges 1: Ad hoc comments on implementation 2: Systematic efforts to document implementation issues 3: Detailed evidence-based account of expected levels of fidelity to program, policy or treatment plans 4: Complete evidence-based account of expected levels of fidelity to program, expected obstacles and specification of elements necessary for replication elsewhere.

M -Mechanisms/mediators: how the policy, practice or program produces its effects
In pharmaceutical medicine, prior to clinical trials, much laboratory work is undertaken to test and refine understanding of the chemical and physiological processes through which a drug produces its effects. Such background work is rarely undertaken in crime prevention, and hence the mechanism(s) through which an intervention might impact upon crime are often poorly understood prior to implementation. Moreover, social interventions are generally complex. What is delivered may differ from one site and time to another and there can be long causal chains between the intervention implemented and effects realized. Working out what it is about an intervention that brings about its intended (and unintended) outcomes is thus of practical importance. A strong primary evaluation will explicate the underlying theory or theories of an intervention, and assemble the relevant data to test it. A strong SR will summarize these theories, and synthesize the available evidence to test them.
To do this, authors of an SR may need to engage with a wider literature than is necessary to estimate the effect size of an intervention. Such studies might more explicitly articulate the mechanism(s) through which an intervention is expected to work, or provide a test of this.
An example of such a review is provided by Weisburd et al. (2015), who conducted a SR of broken windows policing (Wilson and Kelling 1982). To test for evidence of the broken windows mechanism (that intervention reduces residents' fear of crime, and this in turn increases their willingness to act collectively to deter crime: p. 6), the authors searched not for studies that examined the impact of intervention on crime but for those that examined the impact on fear of crime and/or collective efficacy. They found no evidence to support this mechanism, but also concluded that B[t]here have simply been too few studies of the mechanisms underlying crime control in the broken windows policing model^(p. 11). We agree, and suggest that this is a more general issue in primary evaluations and SRs of them.    Table 2 lists the types of evidence that could be included in a SR that seeks to explain how an intervention works. As with the rating of effect size, we propose a 5point scale for assessing the quality of an SR on this dimension.

M -Moderators/contexts: conditions for the activation of the mediator or mechanism
Interventions rarely work unconditionally or equally effectively each time they are applied. The location (geographic or otherwise) and time they are implemented can affect the outcomes observed, as can the characteristics of those who receive or implement them. In deciding if, when, where and on whom to target a specific intervention, policymakers need evidence on which settings and subgroups are most likely to benefit from the intervention, which will most likely be unaffected, and which may have possibly negative outcomes. For this, estimates of mean effect sizes will be insufficient.
Most SRs include statistical moderator analyses to examine effect size variation across subgroups. However, the selection of subgroups varies, as does the rationale for choosing them. Subgroups should not be chosen using standard variables of convenience. Instead, the moderators selected should ideally be those for which the theory (mechanisms and mediators) suggests variations are to be expected. Of course, in the case of SRs, a moderator analysis requires a statistical approach, and so the study authors will be constrained by the data available in the primary studies. However, where the relevant data are unavailable this should be explicitly stated in the review with a view to mobilizing its collection in subsequent primary studies. Table 2 lists the types of evidence that should be included in a SR to document the contexts in which an intervention works, and how the quality of the evidence and the thoroughness with which it was sought out may be assessed.
I -Implementation: how the policy, practice, treatment or intervention is applied For both successful and unsuccessful initiatives, it is important for the practitioner to know what was done, what was crucial to the intervention and what difficulties might be experienced if it were to be replicated elsewhere. For example, SRs of hot spots policing (e.g., Braga and Weisburd 2012) suggest that this approach to crime reduction is successful. However, those intending to replicate previous efforts need to know more than this. They need to know what to do. This would include an indication of how a spatial hot spot is defined-what density of crime defines a hot spot, or what should inform the selection of such an area. In the case of police patrols, they need to know how many officers might need to be deployed, how frequently and for how long. They need to know whether the effect on crime depends on patrol dosage (e.g., the number of officer patrol hours per day per unit area). And so on. This is problematic for crime prevention because without such information, attempts at replication may vary considerably in terms of what is actually done (e.g., Tilley 1996). It is also problematic for evidence synthesis, as evaluations of interventions that prima facie appear to be the same thing, might actually be rather different, and in some cases it may be that nothing was implemented at all. In this case, the primary evaluator and the systematic reviewer should take account of this.
Finally, even simple interventions can be fraught with difficulties (e.g., Johnson and Loxley 2001;Knutsson and Tilley 2009). Thus, practitioners need to know if particular interventions are easy or difficult to implement, if successful implementation is contingent upon particular conditions, and what is liable to impede or facilitate the process. We suggest that a strong review will focus on the issues listed in Table 2.
E -Economic analysis: the cost-effectiveness of the policy, practice, program, treatment or intervention In policy terms, it is necessary but not sufficient that a given measure is capable of producing an intended outcome. In addition to the issues already discussed, the cost of intervention will ideally be known.
Estimating costs is complex. Comprehensive costing will include not only costs incurred by those responsible for the policy but also those falling on any third parties implicated in the delivery of interventions, the program participants themselves and those bearing any negative side effects ('indirect costs'). As programs expand, there are often diminishing marginal costs on those delivering interventions, as set-up and capital costs ('fixed costs') are spread over an increasing volume of activity, and so only those variable costs that are explicitly associated with increased output (e.g., police time) will increase.
Various forms of economic analysis exist, two of which will be briefly discussed. Cost effectiveness is relatively straightforward. It can speak either to the unit of output (e.g., cost of treatment per day per offender imprisoned) or the unit of outcome (e.g., cost per crime prevented). Such analysis helps to inform practitioners of what it may cost to deliver a given level of intervention, or crime reduction, and enables comparisons across interventions.
Cost-benefit calculations are more difficult as they require monetization of both the costs of intervention and (say) crimes prevented. This is particularly complicated as the range of those implicated expands, as unintended side effects are incorporated and as emotional as well as direct financial costs and benefits are swept into the calculations (see Farrell et al. 2004).
We will not discuss these forms of analysis further (but refer the interested reader to Farrell et al. 2004;McDougal et al. 2008), except to emphasize the fact that the estimation of costs should ideally enumerate the complete portfolio of costs that are necessary to implement an intervention. McDougal et al. (2008) suggest a rating scale to assess the methodological adequacy of SRs that includes a cost-benefit analysis, but this has no provision for rating SRs that include only a cost-effectiveness analysis. Since the latter are helpful to practitioners, Table 2 shows the forms of evidence that could be reported in a review and our proposed quality rating scale for this dimension of EMMIE.

Using 'EMMIE'
We have proposed five dimensions for rating the quality of SRs, described by the acronym EMMIE. Each dimension speaks to a different element of an SR, and may inform the decision-making or activity of different practitioners, or different stages of the policy-making process. Consequently, when rating reviews, we suggest that an EMMIE profile be produced rather than a single overall score. While our focus here has been on the rating of SRs, as noted above, with slight adaptation EMMIE scores can and should also be awarded to primary studies.
The use of EMMIE to rate existing SRs can help practitioners to assess the confidence they should place on the conclusions of a review. Applying the framework on an ad hoc basis will be helpful, but efforts by a consortium of universities led by UCL, in collaboration with the UK College of Policing, are also underway to systematically rate existing SRs using EMMIE (see Bowers et al. 2014; note that future publications will discuss the practicalities of operationalizing the approach and provide empirical examples). The ultimate aim of the exercise is to provide practitioners with an online tool (hosted by the UK College of Policing) to assist their engagement with, and understanding of, the available evidence.
As well as being used to rate existing studies, it is our hope that the EMMIE framework will inform the conduct of future primary studies and SRs. At this point in time, we expect (and have started to find that) existing SRs achieve relatively lower ratings on the MMIE dimensions than they do for effect size (E). However, by encouraging researchers to explicitly focus on these issues in future primary studies and reviews of them, we hope that this will soon change.
With this in mind, three points are worthy of discussion. First, the research methods required to score high on each dimension are liable to differ, some depending heavily on quantitative methods, others on more qualitative approaches, such as realist synthesis (e.g., Pawson 2002). Thus, as is hinted in the title of this article, we encourage the use of mixed-method SRs. Second, to score high on all dimensions of EMMIE, future SRs will ideally employ broader inclusion criteria during the search stage of the review than is traditional, searching for research that addresses dimensions of EMMIE other than effect size. SRs are, of course, time consuming to conduct and hence some pragmatism will be required. Where an extended search proves to be impractical, we suggest that the review authors note this and synthesize what evidence is uncovered as it speaks to each dimension of EMMIE. Moreover, to set an agenda for primary studies, one role of future SRs will be to explicitly note the absence of evidence for each dimension of EMMIE (see also Gill 2011;Perry et al. 2010).
It is unlikely that any single primary study will or could score full marks on all dimensions of EMMIE. One reason for synthesizing diverse studies is to draw together what is known across all dimensions. Confining attention to the methodological adequacy with which effect sizes are estimated can establish with some certainty what has worked and hence what can work. Limiting attention in this way, however, is less useful in working out what will work, particularly in new conditions, and what needs to be present and what needs to be done to make something work as efficiently and as effectively as possible. Yet the latter are crucial for policy decisions. Consequently, the EMMIE framework is intended to catalyze both primary and secondary research that speaks to this agenda. reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.