Background

The recognition that experimental efficacy studies alone are insufficient to improve public health [1] has led to the rapid expansion of the fields of implementation and improvement sciences [2,3,4,5]. However, studies that aim to identify strategies that facilitate adoption, sustainability, and scalability of evidence may not translate well within traditional efficacy and effectiveness research paradigms [6].

The need for new tools to aid investigators and research stakeholders in implementation science became clear during evaluation of grant submissions to the Evans Center for Implementation and Improvement Sciences (CIIS) at Boston University. CIIS was established in 2016 to promote scientific rigor in new and ongoing projects aimed at increasing the use of evidence and improving patient outcomes within an urban, academic, safety net medical center. As part of CIIS’s goal to foster rigorous implementation and improvement methods, CIIS established a call for pilot grant applications for implementation and improvement sciences [7]. Proposals were peer-reviewed using traditional National Institutes of Health (NIH) scoring criteria [8]. Through two cycles of grant applications, proposal reviewers identified a need for improved evaluation criteria capable of identifying specific strengths and weaknesses in order to rate the potential impact of implementation and/or improvement study designs.

We describe the development and evaluation of ImplemeNtation and Improvement Science Proposal Evaluation CriTeria (INSPECT): a tool for the standardized evaluation of implementation and improvement research proposals. The INSPECT tool seeks to operationalize criteria proposed by Proctor et al. as “key ingredients” that constitute a well-crafted implementation science proposal, which operate within the NIH proposal scoring framework [6].

Methods

Assessment of need

CIIS released requests for pilot grant applications focused on implementation and improvement sciences in April 2016 and April 2017 [7]. The request for applications described an opportunity for investigators to receive up to $15,000 for innovative implementation and improvement sciences research on any topic related to improving the processes and outcomes of health care delivery in safety net settings. CIIS funds pilot grants with the goal of providing investigators with the opportunity to obtain preliminary data for further research. Proposals were required to include a specific aims page and a three-page research plan structured within the traditional NIH framework with subheadings for significance, innovation, approach, environment, and research team. The NIH framework was required because it corresponds with the grant proposal structure required by the NIH. A study budget and justification as well as research team biographical sketches were required with no page limit restrictions. CIIS received 30 pilot grant applications covering a broad array of content areas, such as smoking cessation, hepatitis C, diabetes, cancer, and neonatal abstinence syndrome.

Six researchers with experience in implementation and improvement sciences served as grant reviewers. Four reviewers scored each proposal. Reviewers evaluated the quality of pilot study proposals, assigning numerical scores from 1 to 9 (1 = exceptional, 9 = poor) for each of the NIH criteria (significance, innovation, investigators, approach, environment, overall impact) [8]. CIIS elected to use the NIH criteria to evaluate the pilot grant applications because the criteria are those used by the NIH peer review systems to evaluate the scientific and technical merit of grant proposals. The CIIS grant review team held a “study section” to review and discuss the proposals. However, during that meeting, reviewers provided feedback that the NIH evaluation criteria, based in the traditional efficacy and effectiveness research paradigm, did not offer sufficient guidance for evaluating implementation and improvement science proposals, nor did it provide enough specificity for the proposal writers who are less experienced in implementation research. Grant reviewers requested new proposal evaluation criteria that would better inform score decisions and feedback to proposal writers on specific aspects of implementation science including measuring the strength of implementation study design, strategy, feasibility, and relevance.

Despite the challenges of using the traditional NIH evaluation criteria, the review panel used those criteria to score all of the grants received during the first 2 years of proposal requests. CIIS pilot grant funding was awarded to applications that received the lowest (best) scores under the NIH criteria and received positive feedback from the review panel.

The request for more explicit implementation science evaluation criteria prompted the CIIS research team to conduct a qualitative needs assessment of all 30 pilot study applications in order to determine how the proposals described study designs, implementation strategies, and other aspects of proposed implementation and improvement research. Three members of the CIIS research team (MLD, AJW, DB) independently open-coded pilot proposals to identify properties related to core implementation science concepts or efficacy and effectiveness research [9]. The team identified common themes in the proposals, including an emphasis on efficacy hypotheses, descriptions of untested interventions, and the absence of implementation strategies and conceptual frameworks. The consistent lack of features identified as important aspects of implementation science reinforced the need for criteria that specifically addressed implementation science approaches to guide both proposal preparation and evaluation.

Operationalizing scoring criteria

We identified Proctor et al.’s “ten key ingredients” for writing implementation research proposals [6] as an appropriate framework to guide and evaluate proposals. We operationalized the “ingredients” into a scoring system. To construct the scoring system, a four-point scale (0–3) was created for each element. In general, a score of 3 was given for an element if all of the criteria requirements for the element were fully met; a score of 2 was given if the criteria were somewhat, but not fully addressed; a score of 1 was given if the ingredient was mentioned but not operationalized in the proposal or linked to the rest of the study; and a score of 0 was given if the element was not addressed at all in the proposal. Table 1 illustrates the INSPECT scoring system for the 10 items, in which proposals receive one score for each of the 10 ingredients, for a cumulative score between 0 and 30.

Table 1 Implementation and Improvement Science Proposal Evaluation Criteria

Testing INSPECT

We used the pilot study proposals submitted to CIIS to develop and evaluate the utility and reliability of the INSPECT scoring system. Initially, two research team members (ELC, DB) independently applied the 10-element criteria to 7 of the 30 pilot grant proposals. Four team members (MLD, AJW, ELC, DB) then met to discuss these initial results and achieve consensus on the scoring criteria. Two team members (ELC, DB) then independently scored the remaining 23 pilot study applications using the revised scoring system. Both reviewers recorded brief justifications for each of the ten scores assigned to individual study proposals. The two coders (ELC, DB) then met to compare scores, share scoring justifications, and determine the final item-specific scores for each proposal using group consensus.

Inter-coder reliability with the scoring protocol was measured using Krippendorff’s alpha to assess observed and expected disagreement between the two coders’ initial individual item scores [10, 11]. An alpha coefficient of 0.70 was deemed a priori as the lowest acceptable level of agreement to establish reliability of the new scoring protocol [10, 11]. Frequency analyses were conducted to determine the distribution of final element-specific scores (0–3) across all proposals. We calculated a correlation coefficient to assess the association between proposal scores assigned using the NIH framework and scores assigned using INSPECT. All calculations were performed in R version 3.3.2 [12].

Results

Iterative review of the 30 research proposals using Proctor et al.’s “ten key ingredients” resulted in the development and testing of the INSPECT system for assessing implementation and improvement science proposals.

Figure 1 displays the skewed right distribution of cumulative proposal scores, with most proposals receiving low overall scores. Out of a possible cumulative score of 30, proposals had a median score of 7 (IQR 3.3–11.8).

Fig. 1
figure 1

Distribution of cumulative proposal scores assigned using ImplemeNtation and Improvement Science Proposal Evaluation CriTeria (INSPECT)

Table 2 presents the distribution of cumulative and item-specific scores assigned to proposals using the INSPECT criteria. Across individual elements, proposals scored highest for criteria describing care/quality gaps in health services. Thirty-six percent of proposals received the maximum score of 3 for meeting all care or care or quality gap element requirements, including using local setting data to support the existence of a gap, including an explicit description of the potential for improvement, and linking the proposed research to funding priorities (i.e., safety net setting).

Table 2 Distribution of ImplemeNtation and Improvement Science Proposal Evaluation CriTeria (INSPECT) Scores

Proposals generally scored poorly for other criteria. As shown in Table 2, most study proposals received scores of 0 in the categories of evidence-based treatment to be implemented (50%), conceptual model and theoretical justification (70%), setting’s readiness to adopt new services/treatment/programs (53%), implementation strategy/process (67%), and measurement and analysis (70%). For example, reviewers gave scores of 0 for the “evidence-based intervention to be implemented” element because the intervention was not evidence-based and the project sought to establish efficacy, rather than to examine uptake of an established evidence-based practice. Similarly, proposals that only sought to study effectiveness and did not assess any implementation outcomes [13] (e.g., adoption, fidelity) received scores of 0 for “measurement and analysis.” None of the study proposals primarily aiming to assess effectiveness outcomes expressed the dual research intent of a hybrid design. Scores of 0 for other categories were given when applications lacked any description relevant to the category, such as no conceptual model, no implementation strategy, or no research team skills revenant to implementation or improvement science.

Table 2 displays the assessed rates of inter-coder reliability in applying INSPECT to the 30 pilot study proposals. An overall alpha coefficient of 0.88 was observed between the coders. Rates of inter-coder reliability in applying each of the 10 items to the proposals ranged from 0.77 to 0.99, all above the 0.70 reliability threshold.

Additionally, we observed a moderate inverse correlation (r = − 0.62, p < 0.01) between the proposal scores initially assigned using the NIH framework and the scores assigned using INSPECT.

Discussion

We developed a reliable proposal scoring system that operationalizes Proctor et al.’s “ten key ingredients” for writing an implementation research grant [6]. Previous research analyzing peer-review grant processes has highlighted a need to improve scoring agreement between peer reviewers [14]. High levels of disagreement in assessors’ interpretation of grant scoring criteria result in unreliable peer-review processes and funding decisions based more on chance than scientific merit [14]. Measuring rates of inter-rater reliability are a standard approach for evaluating the utility of existing proposal scoring criteria and assessing efforts to improve the criteria [15, 16]. Application of the INSPECT system demonstrated high inter-rater reliability overall and within each of the 10 items. The high degree of reliability measured for INSPECT may be related to the specificity of its design as an implementation and improvement science scoring criteria. A review of scoring rubrics reported in the scientific literature suggests that topic-focused criteria contribute to increased scoring reliability [17]. Additionally, the moderate correlation between scores assigned using the NIH framework and scores assigned using INSPECT suggests validity of the INSPECT criteria in evaluating proposal quality. Proctor et al.’s “ten key ingredients” for grant writers were developed to map onto the existing NIH criteria. Our operationalized version of the ingredients as scoring criteria demonstrated that proposals that scored poorly under NIH criteria also scored poorly under INSPECT.

Applying the INSPECT system to proposed implementation and improvement science research at an academic medical center improved proposal reviewers’ ability to identify specific strengths and weaknesses in implementation approach. Overall, proposals only received high scores for identifying the care gap or quality gap. Since efficacy and implementation or improvement research may use similar techniques to establish the significance of the study questions [18], proposals may score well on describing the quality gap, even if they later described efficacy hypotheses that received overall low scores from the INSPECT system. Further studies should explore techniques for describing care and quality gaps that highlight implementation or improvement research questions.

Consistently low scores in four areas—defining the evidence-based treatment to be implemented, conceptual model and theoretical justification, setting’s readiness to adopt new programs, and measurement and analysis—suggest that many investigators seeking to conduct implementation research may have misconceptions about the fundamental goals of this field. One misconception may relate to a sole focus on evaluating an intervention’s effectiveness rather than studying the processes and outcomes of implementation strategies. The majority of study proposals evaluated using INSPECT neither aimed to improve uptake of any evidence-based practice nor included any implementation measures such as acceptability, adoption, feasibility, fidelity, penetration, or sustainability [19]. Inadequate and inconsistent descriptions of implementation strategies and outcomes represent major challenges to overall implementation study success [20]. In addition to guidance provided by the INSPECT criteria, recent efforts to develop implementation study reporting standards [21] may assist proposal writers in describing planned research.

Several proposals addressed treatments or practices with low evidence for the potential to improve healthcare. Although hybrid studies, which study both effectiveness and implementation outcomes, are practical approaches to establishing the effectiveness of evidence-informed practices while measuring implementation efforts [18], none of the study proposals expressed this dual research intent or were conceived as hybrid designs.

Our findings also suggest low familiarity with and use of resources to evaluate of the strength of evidence (such as the Grading Quality of Evidence and Strength of Recommendations system [22] and the Strength of Recommendation Taxonomy grading scale [23]) for implementation science research. A more systematic evaluation of the strength of evidence [24,25,26,27] necessary to warrant implementation efforts may help to differentiate implementation science from efficacy or effectiveness research and improve understanding of the utility hybrid studies offer [28].

Expanding access to implementation science training in universities as part of the core health services research curriculum and enhancing access to professional development opportunities that focus on conceptual and methodological implementation skills in a content agnostic way would aid in building capacity for the next generation of implementation science researchers. Additionally, training programs provide an opportunity to provide guidance on both writing and evaluating the quality of implementation science grant applications.

Strengths of our results include that application of INSPECT to study proposals submitted by investigators with a wide range of implementation and improvement science-specific experience, and covering a variety of content areas. However, our results are limited in that they characterize one academic institution’s familiarity with implementation and improvement science research and the INSPECT system requires validation in other settings and over a broader range of proposal ratings. Additionally, we measured a high degree of inter-rater reliability for INSPECT when it was applied to a sample of low-scoring proposals. INSPECT’s inter-rater reliability may decrease when applied to a sample of higher quality proposals, and reviewers are required to discriminate between gradations of quality (i.e., scores of 1–3) rather than mostly scoring the absence of key items (i.e., scores of 0). Future research should test the validity of INSPECT by comparing INSPECT-assigned scores to ratings assigned to approved proposals by the NIH Dissemination and Implementation Research in Health study section. Future research should also assess the relationship between INSPECT score assignments and successful study completion to determine the utility of INSPECT as a mechanism for ensuring the quality and impact of funded research. To aid in these prospective research efforts, forthcoming proposal calls from CIIS will specifically use INSPECT as the proposal evaluation criteria.

Although multiple tools exist to aid researchers in writing implementation science proposals [6, 29, 30], few resources exist to support grant reviewers. This study identified additional functionality of Proctor et al.’s “ten key ingredients” as a guide for writers by developing it into a detailed checklist for proposal reviewers. The current research makes a substantive contribution to implementation and improvement sciences by demonstrating the utility and reliability of a new tool designed to aid grant reviewers in identifying high-quality research.

Conclusion

In conclusion, we operationalized an implementation and improvement research-specific scoring system to provide guidance for proposal writers and grant reviewers. We demonstrated the utility and reliability of the new INSPECT scoring systems in evaluating the quality of implementation and improvement sciences research proposed at one academic medical center. The prevalence of low scores across the majority of INSPECT criteria suggests a need to promote education about the goals of implementation and improvement science, including the conceptual and methodological distinctions from efficacy and effectiveness research.