Introduction

Mental illness affects 450 million people worldwide and is the leading cause of global ill-health and disability (Whiteford et al. 2013). The estimated lost productivity and global economic impact due to mental illness will be US$16.3 trillion in the next two decades (Patel et al. 2015), and life expectancy for this population is estimated at 10 to 25 years fewer than people without mental illness (Whiteford et al. 2013; Mokdad et al. 2016; Walker et al. 2015). In the Canadian context, this widespread personal, social, and economic impacts of mental illness have long been documented (Mokdad et al. 2016; MacDonald et al. 2018; Health Canada, 2018; Forchuk et al. 2016; Munson and Jaccard 2018). In the last decade, mental health services in Canada have shifted focus to recovery-oriented and trauma-informed practice (The Mental Health Commission of Canada 2012). In 2016, the Mental Health Commission of Canada published the guidelines for recovery-oriented practice. Personal recovery as an outcome soon became a priority in Canada for health and social services. Yet, no outcome measure, fit for purpose to capture the recovery outcomes of Canadians exist. Current mental health assessments, which are focused on illness and symptoms, are not aligned with this shift in practice and hence would be inadequate to inform treatment in this new paradigm of care (Wand et al. 2020). As such, service providers and policy-makers face pressures to articulate how their strategic plans, care plans, and outcomes fit within a recovery-oriented framework (Kidd et al. 2014, 2010; Slade et al. 2014).

Historically, psychiatric rehabilitation has focused strongly on output measures (e.g., decreased emergency department visits, hospital days, and cost of care) rather than outcome measures (e.g., recovery, function, and quality of life) (Adepoju et al. 2018; Chiu et al. 2018; Saunders et al. 2018; Urbanoski et al. 2018; Kisely et al. 2015). This is evident in Canada where nationwide routine care measurement for people with mental illness is not standardized and service use or hospital readmissions are used as indirect outcomes (Kisely et al. 2015; Kisely 2016). In addition, outcomes in psychiatric rehabilitation measures are often not directly observable (Barbic et al. 2019). Areas like quality of life, health perception, and wellbeing, which are of concern to both patients and health professionals, are direct, patient level outcomes and are elicited from the patient. However, across the different provinces in Canada, except for Ontario, the patient reported outcome measures (PROM) used are mainly focused on symptoms like depression (Kroenke et al. 2010), anxiety (Spitzer et al. 2006) and general mental health (Canada Institute for Health Information. 2023). At the time of this study, the only recovery focused tool used by provincial health authorities is the Ontario Common Assessment of Need (OCAN) which is used across community mental health settings in the province of Ontario. The OCAN is a measure of personal needs and is used to track and monitor the client’s recovery (Durbin et al. 2020). The measurement of personal needs is important, but needs are necessary but not sufficient elements of recovery (Slade et al. 2005). A necessary cause must be present for event to occur while the presence of a sufficient cause is enough for the event to occur (Fayers et al. 1997). While there are other existing recovery measures used, many were designed using Classical Test Theory (CTT) methods and have poor to moderate psychometric properties when tested with modern measurement approaches (Vicki Shanks et al. 2013). To date, only the Recovering Quality of Life (ReQoL) measure was developed using item response theory, but this measure is a recovery focused quality of life measure (Keetharuth et al. 2018). In addition, this tool was not developed with a Canadian sample.

Most existing recovery measures were not developed to inform clinical practice and guide care and evidence supporting their clinical utility is limited (Barbic et al. 2018a; Barbic et al. 2015). Given the centrality of recovery as a desired outcome, conceptual clarity and practical quantification of this construct is of critical importance. One of the most used frameworks of personal recovery in mental health is the CHIMES model (Bird et al. 2014). This framework has 13 identified characteristics of the recovery journey and describes five recovery stages (Leamy et al. 2011). This conceptual model of recovery depicts personal recovery as a reflective model of measurement, where the construct (personal recovery) is reflected by the items, indicating the effect of the construct (Fayers and Hand 1997; Edwards and Bagozzi 2000). Existing personal recovery measures that are mapped onto the CHIMES framework include the Questionnaire about the Process of Recovery (QPR) (Neil et al. 2009) and the Recovery Assessment Scale (RAS). However, the QPR has not been tested using modern modern measurement approaches, and the RAS did not perform well under Rasch analysis. (Barbic et al. 2015; Hancock et al. 2011) Therefore, we aim to develop a psychometrically sound unidimensional measure that can aid clients in their recovery journey and for healthcare professionals in their treatment and care of people with mental illness.

Anchored in Canada’s Strategy for Patient-Oriented Research (SPOR) (Canadian Institutes for Health Research 2015; Canadian Institutes for Health Research 2016) and applied measurement methods (Pusic et al. 2009) guided by Classical Test Theory (Lord 1952) and Rasch Measurement Theory (Andrich 1988), the overall objective of this study was to develop a method of measuring recovery in community-dwelling Canadians with mental health concerns with specific objectives to identify a set of items that best fit the Rasch model to create a hierarchically ordered set representing the recovery experience of people with mental health concerns and to validate the stability of the item hierarchy of this set of items in a sample of Canadians with mental health concerns.

Methods

This study was conducted in four sequential phases that were informed by guidelines for PROM development described by the Scientific Advisory Committee of the Medical Outcomes Trust (Aaronson et al. 2002), the US Food and Drug Administration (FDA) (Food et al. 2009), and the International Society for Quality of Life Research (Reeve et al. 2013). Our team aimed to develop a measure of personal recovery for adults aged 18 years and older, accessing and receiving community mental health services. Figure 1 outlines the development phases of this recovery outcome measure and the methods used in each phase. In all phases, informed written consent was obtained from all participants. Phase 1, 2 and 3 of this study received ethical approval from [blinded for review] and Phase 4 received ethical approval from [blinded for review].

Fig. 1
figure 1

Development Phases of the CPROM

Phase 1: Qualitative Conceptualization of Recovery from the Perspective of People with Lived Experience

The objective of this phase was to generate important domains of recovery from people with mental health disorders.

Sample

Using an inquiry method of analysis, focus groups with community-dwelling individuals with a diagnosis of a mental illness were conducted. We aimed to conduct focus groups until saturation was reached. Purposive sampling was used for age and gender. Inclusion criteria were as follows: 1) aged 18 to 70; 2) diagnosed with a mental health condition, 3) have used or is currently using community mental health services and outreach services, and 4) being able to communicate in English.

Procedures

A content mapping exercise was conducted to guide the focus group processes (Stone et al. 1999). As aligned with the theoretical knowledge of recovery as a continuum, focus group leaders gave participants a 30 cm ruler labelled with the term “recovery.” Participants were asked individually to “walk up the ruler” and to write down statements or sentences that described what it meant to go from low to high recovery. Participants were asked (1) individually, (2) in small groups, and (3) collectively to conceptualize a hierarchy of concepts that would cover the range of the “recovery” ruler”. For example, what does the worst possible recovery look like?; What does the best possible recovery look like?; and what does a person look like as they move from low to high recovery? All interviews were audio recorded and transcribed verbatim. One member coded the individual, small group, and large group data line by line into domains, themes, and subthemes found to capture the range of the recovery ruler. A second team member validated the codes. Finally, peer debriefing and triangulation with participants were done to verify the results of the analysis.

Phase 2: Item and Scale Development

The objectives of this phase were to select items and the corresponding appropriate response scale for each for each domain of recovery.

Sample

We aimed to conduct a focus group interview for this phase. Using the same inclusion criteria and sampling method as Phase 1, new participants were recruited.

Procedures

From results from the first phase, an item pool was generated by the research team. Participants were asked to pick from this item pool, items that match the domains covering the range of the recovery ruler. We also proposed several response scale options to the participants (i.e., strongly agree to strongly disagree; most important to not important; % time). Once the preferred category response scale was selected, we asked participants to reword the items. The rationale for reworded item was based on past experiences with PROMs whose items often behaved aberrantly in mental health settings due to reverse scoring (Chang and Chan 1995; Hancock et al. 2015). Once the items were developed, we asked participants to work together to develop an a priori hypothesis about the ordering of the items that captured recovery (from easiest to hardest). These items formed the “candidate items” for the measure. The ordering of the items made up the “candidate measurement model”.

Phase 3: Prototype Measure Development

The objective of this phase was to assess the measurement structure of the prototype measure.

Sample

Participants were recruited from five community mental health centres in [blinded for review]. Individuals were eligible if they were receiving mental health services in the community, 18 years and older, able to speak English, and consent to this phase of the study. Advertisements were placed at recruitment sites, outlining dates that our research team would be present to provide information about the study. Interested individuals met with a trained researcher to review study objectives and provide written consent. We aimed to recruit 200 participants in the first step of this phase and 10 participants for the second step.

Procedures

There were two steps in this phase. The first phase was to test the prototype measure in a representative population. The next step was to conduct cognitive interviews. For the first step, item hierarchy was examined using Spearman rank test (Stone et al. 1999). Classical Test (Lord 1952) and Rasch Measurement Theory (RMT) methods (Rasch 1961, 1980; Masters 1982) guided our analysis. Rasch Unidimensional Measurement Model (Andrich et al. 2007) software (Rasch unidimensional measurement models software. 2007) was used to examine data for targeting, item fit, internal validity (i.e., model fit, item threshold ordering,), reliability, dependency, and raw score to linearized measurement (Tennant et al. 2004). The detailed methods of this measurement application and approach have been described elsewhere (Barbic et al. 2014; Hobart et al. 2013; Klassen et al. 2016).

For the second step, participants were asked to provide feedback on the psychometric anomalies (item misfit, response options etc.) found in field-testing and the measure itself. We also engaged this participant group in a branding exercise to name the set of items for the new measure. We aimed to conduct interviews until no further changes were suggested. Phase 3 resulted in a 30-item measure we called the Canadian Personal Recovery Outcome Measure (C-PROM) for another round of the field testing.

Phase 4: Field Testing the 30 Item C-PROM in a Community Sample

The objective of this phase was to contribute evidence towards the feasibility and usability of the C-PROM for community dwelling people with mental health disorders in Canada.

Procedures

We administered the C-PROM items to new participants with mental illness living in British Columbia, Canada. Eligibility criteria was the same as Phase 3. The target sample was recruited from three community mental health centres in [blinded for review]. Study information was provided at the front desk of each centre and study staff were on site to review the study procedures if patients were interested. Written consent was provided and participants were sent a question package that included a demographic form, the C-PROM items, and the six other measures. We examined the psychometric properties of the modified item set and measurement model underpinning the C-PROM using CTT and RMT methods including evaluating scaling assumptions (legitimacy of summing items), reliability, and validity (Hobart et al. 2007). C-PROM data were also examined for quality (percent missing for each item), scaling assumptions, item fit, scale to sample targeting (score means; standard deviation (SD); floor and ceiling effects), internal consistency and reliability (Cronbach’s alphas) and correlation of logit and ordinal scale scores.

Results

A total of 26 individuals participated in the focus groups in both Phases 1 and 2, mean age was 47.1, 18 were females, all reported a diagnosis of schizophrenia. In Phase 1, two focus groups were conducted and in Phase 2, one focus group was conducted.

Phase 1: Qualitative Conceptualization of Recovery from the Perspective of People with Lived Experience

A total of 19 adults participated in this phase (n = 10, n = 9). There were nine females and 10 males. Individually, participants identified 120 components of recovery that they believed to be necessary on the ruler. These 120 components were categorized into subthemes and themes. 40 important domains were developed based on the themes and sub-themes generated. To preserve the integrity of the ruler concept, participants asked that the sub-themes be retained and not collapsed into themes and the different domains at this point. Figure 2 shows the subthemes generated by the focus groups and how they lined up against the ruler.

Fig. 2
figure 2

Subthemes from the Phase 1 focus groups

Phase 2: Item and Scale Development

Seven participants agreed upon a five-point response scale that ranged from “none of the time” to “all of the time” without any numbers. Using this response scale, 40 candidate items were generated items for each sub-theme and an item hierarchy was hypothesized. Table 1 shows the 40 item prototype measure and the hypothesized item hierarchy.

Table 1 The 40-item prototype measure

Phase 3: Mixed Methods Field-Testing of the Candidate Items and Measurement Model

A total of 228 individuals participated in the first step of this phase. As seen in Table 2, participants were 45.8 years (SD-12.5) and 50.7% identified as male. Preliminary results of this step has been previously published (Barbic et al. 2018b).

Table 2 Participant Demographics (n = 228) in Phase 3

Analysis of these items showed moderate overall fit to the Rasch model (χ2 = 480, df = 160, p = 0.001), high reliability (rp = 0.94), an ordered response scale structure for 38/40 items (item response scale working as intended), and no item bias for gender, age, or education. The hypothesized hierarchy held for 32/40 items, with items within ± 4 rank placings deemed to be within an acceptable range for this phase of instrument development. As shown in Fig. 2, when examining item locations of full complete measures (n = 199) and item response dependency, several redundant items were identified—specifically, within the range of -0.5 to + 0.5 logits (Table 3).

Table 3 The hypothesized order of the 30-item C-PROM and the results of Phase 4

For the cognitive debriefing interviews, we conducted 10 interviews. During cognitive debriefing, participants reported that the measure was too long and that the item response scale (all the time, most of the time, half of the time, some of the time, none of the time) would benefit from a numerical anchor to help participants interpret the meaning of each category. All participants agreed that everyone had 24 h in a day and therefore the use of time to explain each response option was appropriate. To address their comments, the team decided on a shortened 30-item measure with a new category response scale (with both quantitative and qualitative descriptors). The new response scale is shown in Fig. 3, and the hypothesized item hierarchy for Phase 4 is listed in Table 2.

Fig. 3
figure 3

Final response scale

Phase 4: Field Testing the 30 Item C-PROM in a Community Sample

In this phase, 575 individuals completed the C-PROM. 75 participants completed the CPROM at two timepoints (6 months interval). Participants were 45.7 years (SD = 10.3) and 60% male. Missing data from all items ranged from < 1% to 1.5%. Scale scores were computable for 95% of respondents (n = 546/575). As shown in Fig. 4, scale scores spanned the range of the scale and were not notably skewed. The correlation between logit scores and ordinal scale scores was high (0.92). We did not observe any ceiling or floor effects. Items showed good overall fit to the Rasch model (χ2 = 163, df = 130, p = 0.05), high reliability (rp = 0.96), an ordered response scale structure for 29/30 items (see Fig. 5), and no item bias for gender, age, or diagnosis. The hypothesized hierarchy held for 30/30 items and mapped back to the a priori hypothesized recovery continuum, with the expected order of item difficulty capturing a measurement distribution of low to high. The item response options for all 30 items were ordered as expected. Figure 5 shows the distribution of participants (pink) along the measurement continuum, ranging from -2.5.to 2.3, reflecting a broad spread, it also shows the item threshold range from -2.5 to + 2.1 logits. The overall person targeting of the sample to the items was good. Overall item fit was good (residual <  ± 2.5 for 29/30 items), and reliability was high (rp = 0.95). For those who completed the measure at two time points, the test–retest reliability was good (rp = 0.82). There were two items which did not have good fit, one related to ‘intimate relationships,’ and one related to “enough money to meet basic needs”. However, both items were viewed as conceptually critical in our focus groups. As a result, we decided to maintain the item in the measure and to monitor anomalous behavior closely in future testing.

Fig. 4
figure 4

Phase 3 results of the Person-item threshold distribution of the 30 item Canadian Personal Recovery Outcome Measure (C-PROM). * The distribution on top (pink) represents the distribution of the sample, moving from approximately -3.5 logits (low recovery) to 5.5 logits (high recovery). Beneath the horizontal axis is the distribution of the items

Fig. 5
figure 5

Phase 4 results of the Person-item threshold distribution of the 30 item Canadian Personal Recovery Outcome Measure (C-PROM)

Discussion

Personal recovery in mental health, as defined in literature, is the process of living a meaningful life and performing valued social roles even when symptoms are present (Leendertse et al. 2021; Slade 2009). Hence, recovery-oriented practice in mental health is anchored in the person-centeredness and involvement of the individual. However, the implementation of recovery-oriented approaches poses a significant challenge for health professionals when it becomes difficult to discern the precise stage of an individual's recovery journey (Waldemar et al. 2016). This calls for a measure of recovery that can directly guide and enhance the provision of care. In this study, we have outlined the steps to develop a PROM that is conceptually driven, psychometrically robust, and fit for purpose to directly inform care. The result is a set of items called the Canadian Personal Recovery Measure (C-PROM) that can produce a total score that (a) is interpretable, (b) demonstrates the property of invariance, and (c) is fit for purpose for the Canadian community mental health context.

Development of the C-PROM followed best practices measurement guideline as outlined by the Scientific Advisory Committee of the Medical Outcomes Trust (Aaronson et al. 2002), the FDA guidance (Food et al. 2009; Patrick et al. 2007), and the International Society for Quality of Life Research (Reeve et al. 2013). Significant efforts were made to actively involve all end-users throughout the development process of the C-PROM. In this study, for the initial generation of domains and response options, three focus groups (n = 26) were conducted and a total of 40 domains emerged. These domains ranged from basic needs (“feeling safe”) to social support (“support from loved ones”) and community involvement (“feeling part of my community”), covering all of the CHIME recovery processes (Leamy et al. 2011).

Classical Test and Rasch Measurement Theory guided testing confirmed the presence of a 30-item scale that is reliable, unidimensional, and based on an a priori hierarchical model. The FDA guidance states that total scores from measures with multiple domains should be supported by evidence that the total score represents the concept of interest. The use of RMT and patient engagement in the development of this measure aligns with this guidance. Results of Rasch analysis did identify “intimate relationships” and “money” as misfit items but we decided to retain it based on the results of cognitive interviews and the focus groups. Forming intimate relationships is not only a basic human need but also a core emotional regulation strategy that associated with resilience (Bryant 2016). Having financial security is a basic need that was reiterated by our participants. In addition, the correlation between the original logit scale and the ordinal scale was 0.92, indicating that the final scoring scale did represent the concept of interest (logits) and a simple scoring system can be used. The higher the CPROM score, the better the recovery. From a patient and clinician perspective, this result ensures that the total score from the CPROM is meaningful and outlines how a person moves up and down the recovery ladder (variable of interest) —a fundamental prerequisite of metrology (Stenner et al. 2013).

Our results also showed that the items were stable across different age and gender groups. Recognizing the complexity of the patient population and the construct of interest under investigation, we expected differential item functioning (DIF) to be present. However, in this preliminary testing, the set of 30 items in the C-PROM was stable across age, and sex. Further, the C-PROM also showed minimal floor and ceiling effects. This information is critical to support the potential for the C-PROM to be used to evaluate the effectiveness of treatments across services and organizations over time (Massof and Stelmack 2013; Browne et al. 2017; Cohen et al. 2016; McClimans et al. 2017).

As noted earlier in this paper, in the 1960’s Georg Rasch demonstrated that the stringent criteria for measurement used in the physical sciences could be applied in education and psychology (Rasch 1961, 1980). This study demonstrates early possibilities for applying this methodology in mental health to measure latent outcome such as recovery. The application of both CTT and RMT methods in health is not new (Hobart et al. 2007; Cano et al. 2011; Mayhew et al. 2011; Baker et al. 2011; Barbic et al. 2013; Waugh 2003; Amarneh 2003). Through its fundamental properties of invariant comparisons of people and items on one linear scale this approach provides an ideal infrastructure and metrological framework to guide health researchers to develop and test scales fit for purpose for clinical practice (Rasch 1961, 1980; Wright 1977). Unlike other recovery measures, the CPROM allows clinicians to monitor a client’s recovery process and recommend strategies for next steps in the treatment based on the recovery ladder and a tool for patients to self-monitor their own recovery processes.

Limitations

Our approach had the following limitations. First, the sample available in our study were people diagnosed with schizophrenia. As such, we did not test for DIF across different diagnoses. In addition, our study did not involve a non-clinical sample for comparison. Test–retest reliability was assessed 75 participants; although our findings show evidence for invariant measurement, further work is needed for test–retest reliability and with patients in rural communities, inpatient clinical settings, and non-clinical samples to understand the generalizability of these findings. This study also did not include a comparison between the CPROM and other personal recovery measures. Future research on how the C-PROM is fit for purpose in other international settings such as the United States and how the CPROM compares to other existing measures is needed.

Implications and Conclusion

Our study provides preliminary evidence for the C-PROM as an instrument for the quantification of recovery for adults with mental illness living in the community. This study also contributes evidence that in the context of mental health, recovery is a continuum that starts from basic safety and security needs. Very often, the measurement of recovery and treatment in mental health is focused on symptom alleviation, however, when we recognize that recovery is a continuum, more attention can be paid to the more favorable end of the continuum and health services can be focused on moving people towards health and quality of life (Breslow 1972). We hope that the use of the C-PROM in will assist the health professionals in tailoring their approach to match the client’s recovery journey.