Background

In the field of implementation science, a considerable number of theories and frameworks are being used to better understand implementation processes and guide the development of strategies to improve the implementation of health innovations [13]. Many of these theories and frameworks, however, have not been tested empirically. As such, examining the utility of theories and frameworks has been recognised as critical to advance the field of implementation science [4].

The assessment of implementation theories and frameworks necessitates robust measures of their theoretical constructs. Psychometric properties important for measures of implementation research have been proposed [5] and include the following: reliability (internal consistency and test-retest); validity (construct and criterion); broad application (validated in different settings and cultures); and sensitivity to change (responsiveness). Tools which are acceptable, feasible, and display face and content validity are also particularly useful for researchers in real-world settings [5]. Furthermore, the psychometric characteristics of measures that assess a comprehensive range of implementation constructs have been highlighted as a particular priority area of research [4].

A number of reviews of implementation measures exist [613]. Such reviews indicate that the quality of existing measures of implementation constructs is limited. A review by Brennan and colleagues, for example, identified 41 instruments designed to assess factors hypothesised to influence quality improvement in primary care [6]. The review found that while most studies reported the internal consistency of instruments, very few assessed the construct validity of the measures using factor analysis [6]. Similarly, in a review of the psychometric properties of research utilisation measures used in health care, Squires and colleagues found that, of the 97 identified studies (60 unique measures), only 31 reported internal consistency and only 3 reported test-retest reliability [13]. Twenty percent of the included measures had not undergone any type of validity testing, and no studies reported on measure acceptability [13].

There are a number of limitations of previous reviews. Most do not provide comprehensive details of the psychometric properties of included measures [7, 8, 12] or address only a small number of constructs or outcomes relevant to implementation science [8, 10]. Additionally, the majority of these reviews primarily focus on measures developed for use in healthcare settings [6, 9, 11, 13]. Evidence from the field of psychometric research has suggested that, even when administered to similar population groups, changes in measure reliability and validity can occur when a measure developed in one setting is applied to another setting with different characteristics [14, 15].

Currently, a comprehensive review of measures of implementation constructs is being conducted by the Society for Implementation Research Collaboration (SIRC) Instrument Review Project [16, 17]. The SIRC review addresses some of the limitations of past reviews by extracting a range of psychometric properties from identified measures and assessing a more comprehensive range of outcomes [18] and constructs relevant to implementation science [19]. The outcomes of interest in the SIRC review are taken from Proctor and colleagues’ Implementation Outcomes Framework (IOF) and focus on the appropriateness, acceptability, feasibility, adoption, penetration, cost, fidelity, and sustainability of the intervention itself [18]. The constructs of interest for the review are drawn from the Consolidated Framework for Implementation Research (CFIR), which outlines factors or conditions deemed important to support the successful implementation of an intervention [19]. The constructs are grouped under five domains which describe the following: (1) Intervention characteristics (details of the intervention itself); (2) Outer setting (factors of influence which are external to an organisation); (3) Inner setting (internal characteristics of an organisation such as culture and learning climate); (4) Characteristics of individuals (actions and behaviours of individuals within the organisation); and (5) Process (systems and pathways within an organisation) [19].

To date, the SIRC review has uncovered 420 instruments related to 34 of the CFIR constructs and 104 instruments related to Proctor and colleagues’ IOF [16, 17]. At present, the data are available for the measures relevant to the inner setting domain of the CFIR and the IOF [20]. However, while comprehensive, the SIRC review only pertains to measures primarily applied to healthcare or mental health care settings, where the individuals responsible for implementing health-related interventions are most likely to be healthcare professionals [16, 17]. In the field of public health, the implementation of health-related interventions often occurs in non-clinical settings, with non-healthcare professionals responsible for implementing these changes. Therefore, there is a need to identify measures which have been developed specifically to measure constructs important for the implementation of health-related interventions in community settings, where the primary role of the organisations and individuals is not healthcare delivery.

To our knowledge, no previous reviews of measures of implementation constructs have focussed on instruments designed for use in a broad range of community settings. Such measures are of particular interest to public health researchers who are utilising implementation theories or frameworks to support evidence-based practice in these settings. As such, the aim of this study was to (1) systematically review the literature to identify measures of implementation constructs which have been developed in community settings; (2) describe each measure’s psychometric properties; and (3) describe how the domains of each measure align with the five domains and 37 constructs of the CFIR.

Methods

Scope of this review

The focus of this review was to identify, from peer-reviewed literature, measures which have been developed for use in community-based (non-clinical settings), and which measure constructs aligned to the CFIR. These measures were then examined to determine their psychometric properties and identify which of the CFIR constructs they captured. In this review, ‘measures’ are defined as surveys, questionnaires, instruments, tools, or scales which contain individual items that are answered or scored using predefined response options. ‘Constructs’ are defined as the broad attributes or characteristics which these items (usually grouped into domains) are attempting to capture. The constructs of interest were chosen to align with the CFIR, as this framework is the most comprehensive and draws together numerous theories which have been developed to guide the planning and evaluation of implementation research and combines them into one uniform theory with overarching domains [19].

Design

A systematic search and review was conducted to address the broad question of ‘what psychometrically robust measures are currently available to assess implementation research in public health and community settings’. A comprehensive search of peer-reviewed publications was conducted using four electronic databases and the quality of identified measures was assessed using well-established, pre-defined psychometric criteria.

Eligibility

Publications were included if they (1) were peer-reviewed journal articles reporting original research results; (2) reported research from non-clinical settings; (3) reported details regarding the development of a measure; (4) described a measure which assessed at least one of the 37 CFIR constructs; (5) described a measure which was being applied to a specific innovation or intervention; and (6) used statistical methods to assess the measures’ factor structure.

In this review, clinical settings included the following: hospitals, general practices, allied health facilities such as physiotherapy or dental practices, rehabilitation centres, psychiatric facilities, and any other settings where the delivery of health or mental health care was the primary focus. Non-clinical settings included schools, universities, private businesses, childcare centres, correctional facilities, and any other settings where the delivery of health or mental health care was not the primary focus. Given that an aim of the study was to map the domains of included measures against constructs within the CFIR, it was important that measures displayed a minimum level of construct validity via exploratory or confirmatory factor analysis.

Duplicate abstracts were excluded from the review, as were abstracts describing reviews, editorials, commentaries, protocols, conference abstracts, and dissertations. Publications which reported on measures developed using qualitative methods only were also ineligible.

Search strategy

A search of MEDLINE, PsycINFO, EMBASE, and CINAHL databases was conducted to identify publications describing the development of measures to assess factors relevant to the implementation of innovations. These four databases were selected as they index journals from the field of implementation science and provide extensive coverage of research across a range of public health and community settings, such as schools, pharmacies, businesses, nursing homes, sporting clubs, and childcare facilities.

Prior to the database searches being conducted, four authors met to ensure that the chosen keywords accurately captured the constructs of interest and that keywords were combined using the correct Boolean operators [21]. The core search terms comprised of keywords that related to measurement, the psychometric properties of instruments, the levels at which the measurement could occur (e.g. organisational or individual) and the goals of research implementation. These keywords were as follows: [questionnaire or measure or scale or tool] AND [psychometric or reliability or validity or acceptability] AND [organisation* or institut* or service or staff or personnel] AND [implement* or change or adopt* or sustain*].

Similar to the strategy used in the SIRC review [16, 17], the core search terms were combined with five more keyword searches designed to capture the constructs within each of the five CFIR domains: (1) Intervention Characteristics [strength or quality or advantage or adapt* or complex* or pack* or cost]; (2) Outer Setting [needs or barrier* or facilitate* or resource* or network or external or peer or compet* or poli* or regulation* or guideline* or incentive*]; (3) Inner Setting [structur* or communication or cultur* or value* or climate or tension or risk* or reward* or goal* or feedback or commitment or leadership or knowledge*]; (4) Characteristics of Individuals [belief* or attitude* or self-efficacy or skill* or identi* or trait* or ability* or motivat*]; or (5) Process [plan* or market or train or manager or team or champion or execut* or evaluat*].

The keyword search terms were repeated for all four databases. Keyword searches were limited to the English language; however, no limit was placed on the year of publication, as measurement tools often evolve over many years. Medical Subject Headings (MeSH) were not used in the literature search, as keyword searches have been found to have higher sensitivity, being more successful than subject searching in identifying relevant publications [22].

Identification of eligible publications

One author coded all abstracts according to the inclusion and exclusion criteria. A second author cross-checked 10 % of the abstracts to confirm they had been correctly classified. Full-text versions of publications were obtained for included abstracts. To ensure that no relevant tools had been missed, previous systematic reviews [7, 8, 10] were also screened for relevant measures, as were tools included on the SIRC Instrument Review Project website [20]. Copies of publications for any additional measures that met the inclusion criteria were obtained. Full-text versions of all eligible publications were then obtained and screened to identify the names and acronyms of all relevant measures they described. The reference lists of all eligible publications were also screened for any additional measures, and Google Scholar was used to conduct cited reference searches. A final literature search was conducted by ‘measure name’ and ‘author names’, using Google Scholar. This search strategy ensured that as many publications as possible were found that related to the psychometric development and validation and revalidation and cross-cultural adaptation of identified measures.

Extraction of data from eligible publications

The properties of each measure were extracted from all full-text publications relating to the development of the measure using data reported in the manuscript text, tables, or figures. Extracted data included: (1) the research setting, sample, and characteristics of the intervention or innovation being assessed; (2) psychometric properties including face and content validity, construct and criterion validity, internal consistency, test-retest reliability, responsiveness, acceptability, and feasibility; and (3) whether the measure had undergone a process of revalidation or cross-cultural adaptation.

The psychometric properties of each measure were independently assessed by two authors using the same criteria described in previous systematic reviews [23, 24] and according to the guidelines for the development and use of tests, including the Standards for Educational and Psychological Testing [5, 25, 26]. The Standards provides a frame of reference to ensure all relevant issues are addressed when developing a measure and allows the quality of measures to be evaluated by those who wish to use them [25]. Following the assessment of psychometric properties, two authors then independently coded each publication to determine which measure domains corresponded with which CFIR constructs. When discrepancies emerged, a third author assisted in reaching consensus.

Psychometric coding

Setting, sample, and characteristics of the innovation being assessed

Details regarding the country and setting where the measure was developed, characteristics of the innovation or intervention being assessed, response rate, sample size, and demographic characteristics of the sample (gender and profession) who completed the measure were described.

Face and content validity

An instrument is said to have face validity if both the administrators and those who complete it agree that it measures what it was designed to measure [27]. To have content validity, the description of the measure’s development needed to include: (1) the process by which items were selected; (2) who assessed the measure’s content; and (3) what aspects of the measure were revised [14, 28]. Information regarding any theories or frameworks that the measure was developed to test, as well as whether items were adapted from previously validated measures, was also extracted.

Construct and criterion validity

A measure was classified as having good internal structure (construct validity) if exploratory factor analysis (EFA) was performed with eigenvalues set at >1 [14, 29] and >50 % of the variance was explained [30], or confirmatory factor analysis (CFA) was performed with a root mean square error of approximation (RMSEA) of <0.06 and a comparative fit index (CFI) of >0.95 [31, 32]. The number of items and domains in the measure following factor analysis was recorded. Additional construct validity was determined by assessing whether the measure had convergent validity (correlations (r) >0.40) with similar instruments or divergent validity (correlations (r) <0.30) with dissimilar instruments [33]. Criterion validity was determined by assessing whether the measure was able to obtain different scores for sub-populations with known differences (known-groups validity) [34].

Internal consistency and test-retest reliability

To meet the criteria for internal consistency, correlations for a measure’s subscales and total scale needed to have a Cronbach’s alpha (α) of >0.70 or a Kuder-Richardson 20 (KR-20) of >0.70 for dichotomous response scales [28]. For test-retest reliability, the measure needed to have undergone a repeated administration with the same sample within 2–14 days [35]. Agreement between scores from the two administrations needed to be calculated, with item, subscale, and total scale correlations having a (1) Cohen’s kappa coefficient (κ) of >0.60 for nominal or ordinal response scales [14]; (2) Pearson correlation coefficient (r) of >0.70 for interval response scales [14, 28]; or an (3) Intraclass correlation coefficient (ICC) of >0.70 for interval response scales [14, 28].

Responsiveness, acceptability, feasibility, revalidation, and cross-cultural adaptation

A measure’s potential to detect change over time was confirmed if it could show a moderate effect size (>0.5) for a given change [14, 28, 36], and if it had minimal floor and ceiling effects (less than 5 % of the sample achieved the highest or lowest scores) [37]. To determine acceptability and feasibility (burden associated with using the measure), data on the following were extracted: proportion of missing items, time needed to complete, and time needed to interpret and score [28]. Data from publications reporting the revalidation of a measure with additional samples, or in different languages or cultures, were also extracted [28].

CFIR coding

The domains of each included measure were assessed to determine whether the factors they measured corresponded with one or more of the 37 CFIR constructs [19]. A brief summary of each of the CFIR constructs is presented in Additional file 1. The mapping process was domain-focused (i.e. mapping the overall measure domains to constructs) rather than item-focused (i.e. mapping individual items to constructs) to ensure that the overall construct was well captured. Within a measure, only one domain needed to be judged by the reviewers to address a CFIR construct. Therefore, it was possible that a measure with five domains might only have one of its domains mapped to a CFIR construct. Similarly, a measure with three domains might have all contributing to the same CFIR construct. In the latter scenario, the construct was only counted once.

Analysis

Descriptive statistics (frequencies and proportions) were used to report the number of domains from the included measures which were mapped to each of the CFIR constructs and CFIR domains. Frequencies and proportions were also used to describe the number of measures which met various psychometric criteria.

Results

Identified measures of implementation constructs

The initial searches of MEDLINE, PsycINFO, EMBASE, and CINAHL identified 8547 potentially relevant publications. Of these, 5195 were duplicates leaving 3352 publication abstracts to be coded. Of these 3352 publications, 3317 did not meet the inclusion criteria (see Fig. 1 for PRISMA diagram), leaving 35 eligible publications. The process of identifying measures included in systematic reviews related to the current review [7, 8, 10], and a secondary literature search by measure or author name, lead to the inclusion of an additional 30 publications. A total of 65 full-text publications were retained which described 51 unique measures.

Fig. 1
figure 1

PRISMA flow diagram of the publication and measure inclusion process

Psychometric properties of measures

Setting, sample, and characteristics of the innovation being assessed

Table 1 outlines the details of the setting, sample, and characteristics of the innovation being assessed by each measure. The majority of measures were developed in the USA (n = 28), with Canada and Australia also having developed three or more measures each. Sixteen measures were developed for use in school settings [3852], six for use in universities or colleges [5360], three for use in pharmacies [6163], two for use in police or correctional facilities [64, 65], two for use in nursing homes [66, 67], six for use with whole communities or in multiple settings [6875], and sixteen measures were developed for use in workplace settings or other organisations (e.g. utility companies, IT service providers, human services) [7692]. A broad range of innovations or interventions were assessed, with technology-focussed innovations featuring prominently. Sample sizes in each study ranged from 31 to 1358, and response rates ranged from 15 to 98 %. Sample characteristics (i.e. gender and profession of participants) were inconsistently reported across the studies.

Table 1 Setting, sample, and characteristics of the innovation being assessed

Face and content validity

Almost all measures (n = 47) had undergone a process of face and content validation. The development of 36 measures was guided by an existing theory or framework (Additional file 2). No measures were specifically designed to address all constructs considered important for the implementation of innovations by the CFIR. Twenty-six measures had adapted at least some of their items from pre-existing instruments (Additional file 2).

Construct and criterion validity

The internal structure of 45 instruments was determined via EFA (11 of these also used CFA [42, 49, 52, 54, 55, 59, 65, 67, 77, 78, 82, 9193]), and six studies used CFA alone [39, 40, 68, 72, 75, 83, 94] (Additional file 3). For studies which conducted EFA, 46 % reported that >50 % of the variance was explained by the final factor model. None of the studies that used CFA alone reported acceptable RMSEA (<0.06) or CFI (>0.95). Across all measures, the number of items ranged from 9 to 149, and the number of factors (domains) ranged from 1 to 20. Eight measures were tested for criterion validity for sub-populations with known differences. These measures demonstrated capacity to distinguish between a number of groups with known differences, including the amount of teaching experience [47], familiarisation with technology [59], age [58], and managers and non-managers [77]. Only two measures [41, 82] reported testing for convergent/divergent validity against existing instruments, although only one [82] met the required threshold of having significant positive or negative correlations >0.40 or <0.30 with an external measure. In this instance, these relationships were only reported for some individual domains rather than the total score of the scale.

Internal consistency and test-retest reliability

Fifty of the 51 included measures reported on the internal consistency of either the total scale or the individual domains (Additional file 4). The internal consistency of both the total scale and the domains was reported for four measures [40, 61, 66, 76], the internal consistency of the total scale only was reported for five measures (all alpha’s >0.70) [47, 49, 51, 75, 83], and the internal consistency for the scale domains only was reported for the remaining 41 measures. Twenty measures achieved a Cronbach’s alpha of >0.70 for all of their domains [38, 40, 41, 48, 5052, 54, 59, 60, 63, 76, 79, 81, 84, 85, 87, 89, 90, 95, 96], indicating that more than 50 % of measures did not meet the acceptable threshold for at least one domain. Three measures were examined for test-retest reliability [47, 73, 84]. The administration period was acceptable (2–14 days) for all measures, and adequate test-retest reliability (Pearson’s correlations >0.70) was achieved for all measures, with the exception of one domain (awareness, r = 0.65) in the Stages of Concern Questionnaire [74].

Responsiveness, acceptability, feasibility, revalidation, and cross-cultural adaptation

Seventeen measures reported acceptability and feasibility, with five studies reporting the time that it took to complete the measure (range 10–70 min; M = 34.6 min) [39, 64, 73, 81, 90] and six studies reporting the proportion of missing items observed following the measure administration (range 1.5–5 %) [52, 59, 63, 67, 75, 84] (Additional file 5). Seven studies examined responsiveness in relation to effect sizes [38, 47, 67, 69, 75, 93, 97], and all but one reported an effect size above the threshold criterion of 0.5 [67], indicating that these measures are capable of detecting moderate size change (Additional file 5). No studies reported floor or ceiling effects. Thirteen measures were revalidated in new settings and with different populations across a number of additional studies [55, 77, 91, 96, 98112].

A summary of the psychometric criteria reported by the included measures can be seen in Table 2.

Table 2 Summary of psychometric properties reported for each measure

Mapping of measure domains that align with the 37 constructs of the CFIR

The number of measure domains that mapped onto the CFIR constructs ranged from 1 to 19. Relative advantage, networks and communications, culture, implementation climate, learning climate, readiness for implementation, available resources, and reflecting and evaluating were the constructs most frequently addressed by the included measures. Five of the CFIR constructs were not addressed by any measure (Additional file 6). These five constructs were as follows: intervention source, tension for change, engaging, opinion leaders, and champions.

Discussion

To our knowledge, this is the first systematic review to describe the psychometric properties of measures developed to assess innovations and implementation constructs specifically in public health and community settings. Overall, the psychometric properties of included measures were typically inadequately assessed or not reported. No single measure reported on all key psychometric quality indicators. The majority of studies assessed face, content, construct validity, and internal consistency. However, criterion validity (known-groups), test-retest reliability, and acceptability and feasibility were rarely reported. Only seven measures had responsiveness to change assessed. These findings mirror those of previous reviews [7, 13] that found that few measures demonstrated test-retest reliability, acceptability, or criterion validity.

When measures did report psychometric data, it was typically below the widely accepted thresholds defined in this review. Almost half of the measures that reported undertaking EFA reported that their final factor model explained <50 % of the variance. Furthermore, none of the measures that used CFA alone reported satisfactory RMSEA (<.06) or CFI (>0.95). This suggests that a notable proportion of available implementation measures developed and currently available for use in non-clinical settings are not particularly robust or are prone to misspecification of fit. That only eight of the 51 measures explored criterion validity using known-groups is also concerning. The lack of attention to known-groups validity limits the confidence we can place in these measures being able to detect how groups within community settings (e.g. experienced teachers vs. new teachers) vary in regards to implementation of innovation. This is important for identifying which aspects of an intervention or innovation might need to be adjusted to ensure more robust implementation in the future.

Internal consistency was frequently reported but only 40 % of measures reported that all scale domains had a Cronbach’s alpha >0.70, highlighting a need for further refinement of scale items and revalidation. Only three measures assessed test-retest reliability, another area requiring much greater attention in future studies. Those studies that did assess test-retest reliability performed well, meeting the vast majority threshold criteria. However, the stability of these types of measures over time remains unclear. Acceptability and feasibility data were reported for just 33 % of the measures. Mean completion time for measures was almost 35 min. Although shorter questionnaires have been shown to improve response rates [113], it is unclear what the optimal survey length is while still maintaining the survey validity. Rates of missing data ranged from 1.5 to <5 %, which according to Schafer [114] is acceptable given missing data rates of less than 5 % are likely to be inconsequential. Only 25 % of measures had been revalidated or validated in a different culture. This limits the generalisability of the measures and poses a significant barrier to research translation within potentially underserved communities or cultures [115].

Without more comprehensive assessment of the psychometric properties of these instruments, the ability to ascertain the utility of theories or frameworks to support the implementation of innovations in public health and community settings is limited. For example, understanding the responsiveness of measures is essential for evaluating implementation interventions and ensuring that changes in constructs over time can be detected [116, 117]. Having measures which are acceptable and feasible is also important to the conduct of rigorous research, particularly in more pragmatic research studies [5, 18]. Low survey response rates or high rates of attrition due to onerous research methods can introduce bias and compromise study internal and external validity [118, 119].

Alignment of measure domains with constructs of the CFIR

While some of the CFIR constructs were addressed by domains from multiple measures in this study, five constructs were not assessed by any measure. These were intervention source, tension for change, engaging, opinion leaders, and champions. The development of psychometrically robust measures which can assess these constructs in public health and community settings may be a priority area of research for the field.

The most frequently addressed constructs appeared to fall within the ‘inner setting’ and ‘characteristics of individuals’ domains, suggesting that the focus of measures to date has been on understanding only the immediate environment where the innovation or intervention will be implemented. It appeared that measures addressing ‘outer setting’ or ‘process’ constructs were less frequently observed than other domains. The development of future measures should target these domains of the CFIR to ensure a greater breadth and depth of understanding of all factors which may influence the implementation of evidence into practice in public health and community settings.

Comparison of the current review with the SIRC Instrument Review Project

Despite the similarity in review methodologies utilised by the current review and that undertaken by SIRC [16], few measures have been reported by both reviews. This is not surprising, as although the SIRC review captured some measures developed in education or workplace settings, other public health and community settings were not addressed. Furthermore, the SIRC review used a much broader inclusion criteria with regard to measures of CFIR constructs. For example, for the construct of ‘self-efficacy’, the SIRC review includes all measures of self-efficacy, regardless of the context in which self-efficacy is being examined. In contrast, the current review only includes measures which assess self-efficacy in the context of an individual’s perceived ability to implement the target innovation.

Despite these differences, the use of a common framework (CFIR) for examining constructs captured by different measures in the current review promotes consistency and complements the findings of the SIRC review.

Limitations

It is possible that not all existing implementation measures in public health and community settings were captured by this review. The keywords used to identify measures were limited to ‘questionnaire’, ‘measure’, ‘scale’, or ‘tool’ and other possible terms such as ‘instrument’ and ‘test’ were not used. These terms were excluded due to the likelihood of identifying non-relevant publications related to clinical practice (e.g. surgical instruments, immunologic tests). However, the exclusion of these keywords may have meant that some relevant publications were not identified during the database search. Additionally, the review did not assess measures published in the grey literature and only studies published in English were included. However, it is likely that those measures which were identified represent the best available evidence, given their publication in peer-reviewed journals and indexing in four scientific databases. The psychometric properties that were chosen to be extracted from publications about each measure may have also limited the findings. For example, for studies that utilised CFA, only data pertaining to the RMSEA and CFI were recorded based on recommendations by Schmitt [32]. Included publications may have reported additional CFA metrics (such as goodness of fit (GFI) or the normed fit index (NI)); however, they were not included in this review.

Despite these limitations, the findings from this review are likely to be of value to public health researchers who are looking to identify measures with robust psychometric properties that can be used to assess implementation constructs. There are, however, a small number of constructs for which no measure could be identified. Developing measures which can assess these five remaining constructs will be an important consideration for future research.

Conclusion

Existing measures of implementation constructs for use in public health and community settings require additional testing to enhance their reliability and validity. Further research is also needed to revalidate these measures in different settings and populations. At present, no single measure, or combination of measures, can be used to assess all constructs of the CFIR in public health and community settings. The development of new measures which can assess the broader range of implementation constructs across all of the CFIR domains should continue to be a priority for the field.