Evidence-based practices (EBPs) require reliable and valid instruments to assess fidelity (Bond et al. 2011; Martinez et al. 2014; McHugo et al. 2007). Fidelity to interventions, defined as the degree to which an implementer follows the intervention as specified (Cross and West 2011), is one critical implementation outcome (Proctor et al. 2011).

Illness Management and Recovery (IMR) is a standardized psychosocial intervention designed to help people with serious mental illnesses manage their illness and achieve personal recovery goals (Mueser et al. 2006). Five strategies form the basis of the IMR program: psychoeducation to improve knowledge of mental illness, relapse prevention to reduce relapses and hospitalizations, behavioural training to improve medication adherence, coping skills training to reduce the severity and distress of persistent symptoms, and social training to strengthen social support. The practitioners teach these strategies using a combination of educational, motivational, and cognitive-behavioural techniques, following an accompanying workbook with educational handouts in weekly sessions over 10–12 months, either individually or in groups. The IMR program has spread world-wide (Egeland et al. 2017; Garber-Epstein et al. 2013; Pratt et al. 2011; Roosenschoon et al. 2016), including strong endorsement in Sweden (The National Board of Health and Welfare 2017). A 2014 review concluded that IMR had superior outcomes to treatment as usual, according to observer ratings of psychiatric symptoms, as well as patient and practitioner ratings (McGuire et al. 2014).

The Illness Management and Recovery Fidelity Scale (IMR fidelity) (McHugo et al. 2007) assesses the implementation of specific strategies within the IMR program together with structural and curriculum-based elements, with each item rated on a behaviorally anchored continuum from 1 = no fidelity to 5 = excellent fidelity. A summed and averaged fidelity score of 4.0 or higher or more defines good fidelity, 3.0–4.0 as fair fidelity, and less than 3 as an absence of fidelity (Bond et al. 2009; McHugo et al. 2007). Psychometric assessment of the scale has been limited. One study demonstrated high inter-rater reliability (ICC = .97) (McHugo et al. 2007), and two studies found sensitivity to change following training and consultation (McHugo et al. 2007; Salyers et al. 2009). Nevertheless, no published study has reported a comprehensive psychometric assessment of the IMR Fidelity scale.

This study examined the psychometric properties of the Illness Management and Recovery (IMR) Fidelity scale, including item analysis, interrater reliability, interrater item agreement, internal consistency, and sensitivity to change.



As part of a large implementation study (ClinicalTrials NCT03271242), the research team invited mental health clinics providing treatment for psychosis disorders throughout Norway to participate in a study of implementing evidence-based practices. Eleven sites from six of the 19 health trusts in Norway agreed to implement IMR and received intensive technical assistance in implementing IMR. The current paper reports the findings of a secondary data analysis of IMR fidelity assessments at these 11 sites. Prior to the study initiation, none of the sites were providing IMR. All sites committed to adopting IMR and following the program model and practice manual (Gingerich and Mueser 2011). The Regional committees for medical and health research ethics (REK 2015/2169) approved the study, which followed the principles in the Declaration of Helsinki.

Study Sites

Eight of the 11 mental health clinics were community mental health centers, one was a combined inpatient and outpatient clinic for young adults with psychosis and drug abuse problems, one was an outpatient clinic for children and adolescents, and one was an inpatient clinic for adolescents. The latter sites enrolled youth aged 16 years and older in the IMR program. The participating clinics represented both urban and rural areas.


Each clinic received intensive technical assistance in IMR over 12 months. The technical assistance included 4 days of IMR training with a professional trainer, followed by 30-min weekly group supervision by phone for 6 months, and then every other week for another 6 months.

Each site received a fidelity assessment at baseline, and after 6, 12, and 18 months. A pair of fidelity assessors, independent from the clinical staff delivering IMR, conducted each fidelity assessment. The fidelity assessors varied across sites and assessment periods. A group of 17 researchers (psychologists, psychiatrists, nurses and other health professionals) served as the assessors. All received specific training on IMR fidelity assessment. A senior researcher served as a quality control monitor, reviewing all the fidelity assessments.

The assessors conducted full-day site visits, using an integration of four sources of information: (a) semi-structured interviews with the site leader, (b) semi-structured group interviews with the practitioners that facilitated IMR, (c) progress notes on the patients’ goals and steps towards the goals that were filled out by the practitioners prior to the site visit, and (d) handouts and written materials on the patients’ progress. The two assessors rated each program independently and then compared ratings, resolving discrepancies through discussion in order to reach consensus.


The Illness Management and Recovery Fidelity scale (McHugo et al. 2007) assesses the implementation of specific strategies within the IMR program, such as goal setting and follow-up, motivational techniques, educational techniques, cognitive-behavioral techniques, coping skills training, relapse prevention training, and behavioral tailoring for medication. It also assesses structural and curriculum-based elements, including the number of people in a group, the number of sessions held, the content modules covered, provision of educational handouts, and involvement of significant others (see “Appendix” section). The scale consists of 13 items, with each scored on a 5-point scale (from one indicating no implementation and five indicating full implementation).

A Norwegian translation agency translated the IMR Fidelity scale into Norwegian, in conjunction with the translation of the IMR manual (Egeland 2018). Two of the authors (KME and KSH) reviewed the translation in detail, repeatedly comparing it with the original version. A prior implementation project tested the translated version (Egeland et al. 2017).

Data Analyses

We examined agreement between assessors at the item level by percentage of exact agreement between pairs of assessors. We also examined mean agreement at each time period and across items for each of four time periods.

We calculated each assessor’s total fidelity score for each site, defined as the sum of the item ratings divided by the number of items (i.e., 13). To evaluate interrater reliability of the site fidelity ratings, we used the intraclass correlation coefficient (ICC) (McGraw and Wong 1996), based on a one-way random effects analysis of variance model for agreement between the two fidelity assessors on the IMR Fidelity scale. A single ICC was computed, combining paired ratings across all assessment points.

After assessing interrater agreement and reliability, we used consensus ratings in all subsequent analyses. To estimate internal consistency of the IMR scale, we used Cronbach’s alpha, calculating an alpha coefficient for each time period.

We next examined the item distributions at 18 months, examining mean, standard deviations, and distribution of scores across sites for full (rating = 5), adequate (4), and poor (1–3) fidelity. We also examined the distribution of site scores at 18 months.

Finally, we examined the longitudinal pattern of fidelity graphically and statistically. We examined sensitivity for change over time in IMR fidelity using a one-way ANOVA repeated measures design with pairwise post hoc tests with Bonferroni correction for changes between baseline and each of the three follow-up assessments. Change over time was estimated by calculating the standardized mean difference effect size (Cohen’s dz) for within-subjects design (Lakens 2013). All data analyses were done using SPSS for Windows version 25 (


Agreement Between Assessors on Individual Items

Over all items and time periods, exact agreement on items was very high, averaging 94% (see Table 2 in the “Appendix” section). The mean exact agreement declined from 99% at baseline to 90–93% thereafter. (High agreement at baseline was due to lack of IMR implementation and ratings of one.) At the item level, mean agreement on all four fidelity reviews on Item 13 (behavioural tailoring) was 82% and on Item 7 (goal follow-up) was 87%, while mean agreement on all other items exceeded 90%.

Interrater Reliability

Two fidelity assessors rated the IMR Fidelity scales on four occasions at each of the 11 participating sites. We aggregated paired ratings across four time periods to estimate interrater reliability for the 44 assessments (100% completion rate). The intraclass correlation measuring interrater reliability (assuming two assessors) was .99, indicating a very high degree of agreement. In all subsequent analyses, we report the findings based on consensus ratings.

Internal Consistency

After baseline, internal consistency (Cronbach’s alpha) was excellent: undefined (baseline), .91 (6 months), .94 (12 months), and .95 (18 months), suggesting that the 13 items comprising the IMR Fidelity scale are measuring a unitary construct. Internal consistency at baseline could not be calculated because nearly all items were rated 1 at all sites.

Item Analysis

As shown in Table 1, the item means for the 11 sites at 18 months ranged from 4.18 (Item 2: Program Length and Item 5: Involvement of Significant Others) to 4.82 (Item 12: Relapse Prevention Training). Notably, all of the items reached an average score exceeding 4.0, which is the benchmark for good fidelity. By contrast, at baseline, all mean item ratings were 1.60 or less. Thus, fidelity assessors used the entire rating scale from 1 to 5 for all 13 items, with no evidence of restriction of range.

Table 1 Item distributions on the IMR Fidelity scale at 18 months (N = 11 sites)

Changes over Time

We inspected the longitudinal pattern of changes graphically across the 18-month period for the 11 sites, as shown in Fig. 1. The mean improvement was sharp between baseline and 6 months, increasing from 1.01 to 3.61, and reaching good fidelity at 12 months (benchmark = 4.0) at 12 months and continuing to increase at 18 months (4.02 and 4.50). The change in IMR fidelity over time was highly significant, F (1, 10) = 148.69, p= .00. Post-hoc t tests comparing baseline fidelity ratings to 6, 12, and 18 months confirmed sensitivity to change, with t values of 7.45, at 6 months, 8.26 at 12 months, and 12.10 at 18 months all significant at p < .001. The standardized mean difference effect size (Cohen’s dz) was 3.65.

Fig. 1
figure 1

Development of IMR fidelity from baseline to 18 months

We also examined change over time looking at the percentage of sites attaining good fidelity (4.0 or higher) at each time period. At baseline, none of the sites had any IMR services whatsoever (10 sites rated 1.0 and one site rated 1.1). The number and percentage of sites attaining good fidelity were 6 (55%) at 6 months, 8 (73%) at 12 months, and 10 (91%) at 18 months. Moreover, the number and percentage attaining very good fidelity (4.5 or higher) were 2 (18%) at 6 months, 6 (55%) at 12 months, and 8 (73%) at 18 months. In summary, most sites attained good fidelity by 6 months and very good fidelity by 12 months.


Overall, the psychometric properties of the IMR Fidelity scale were excellent, with very strong interrater reliability and a high degree of agreement between assessors and very good internal consistency at all three follow-up assessments. The scale was sensitive to change, and the entire rating scale from 1 to 5 was used.

High agreement in the overall decision between the assessors indicates that the scale items are easy to understand and to agree on. Using the entire rating scale from 1 to 5 for all 13 items indicates no restriction of range and being sensitive to change. Using the scale in clinics to document improvement may reinforce good clinical practice. It can also inform the need for specific training in different areas. Because fidelity monitoring leads to understanding and sustainment of practices in the clinics (Bond et al. 2009), the findings reinforce wide use of the IMR Fidelity scale for clinical purposes as well as in research.

In general, research on the psychometric properties of Fidelity scales is lacking (Martinez et al. 2014). This study therefore responds to a strong need. Other Fidelity scales should receive similar psychometric attention.

Our findings identified two items with lower (still adequate) agreement: item seven (IMR goal follow-up) and item 13 (behavioral tailoring for medication). Improving agreement on these items would require written documentation and interviews with patients (Bond et al. 2009).

Seven months after the completion of the formal study, fidelity assessors completed a survey on their experiences using the IMR Fidelity scale. Overall, assessors reported some challenges in finding the relevant data but few other difficulties. They reported that the scale was easy to score and had clear instructions. The assessors perceived that the interviews with practitioners provided the most useful source of information. Interviews with leaders and progress notes were less helpful. Nevertheless, using multiple sources (triangulation) enhances validity.

Although several studies have used the IMR Fidelity scale to measure fidelity, the current study is the first to examine psychometric properties thoroughly. Nonetheless, some limitations deserve mention. The fidelity assessments neither included interviews with patients nor observation of IMR sessions. We have not yet assessed predictive validity, the strongest evidence for utility of a Fidelity scale. Although some studies have shown that core principles predict outcomes (Bartholomew and Kensler 2010; Hasson-Ohayon et al. 2007; McGuire et al. 2012), no published study has thus far examined the predictive validity of the IMR Fidelity scale. Experts recommend regular fidelity monitoring (Bond et al. 2009), which is always difficult to implement (Bond et al. 2014; Egeland et al. 2017; Rychener et al. 2009). The widespread use of Fidelity scales awaits electronic health records designed to facilitate quality measurement.


The IMR Fidelity scale coheres well, including excellent interrater reliability, internal consistency, sensitivity to change and use of the full scale. Our study supports its use for clinical and research purposes. Other Fidelity scales need similar psychometric evaluations. Widespread use of Fidelity scales will require electronic health records designed to facilitate quality measurement.