FormalPara Key Points for Decision Makers

Compared with the EQ-5D, the SF-6D can be self-reported to better capture health-related quality-of-life aspects of schizophrenia for the purpose of economic evaluation.

The SQLS can be self-reported to capture complementary information relative to clinician-reported measures.

A more mental health-focused preference-based measure that can capture the negative symptoms of schizophrenia and that can be used for the purpose of economic analysis is still desirable for healthcare-related decision making.

1 Background

Schizophrenia is a mental health disorder characterized by a range of different psychological impacts, including changes in thinking and behavior. The health-related quality of life (HRQoL) and social burden of schizophrenia is large, affecting both patients and their caregivers, for example, their social and financial situation [1, 2]. Many outcome measures have been designed to assess the burden of schizophrenia. These measures may have different conceptual perspectives that can be condition specific, such as the Schizophrenia Quality-of-Life Scale (SQLS) [3], designed to capture schizophrenia-specific QoL aspects, or generic, such as the Warwick-Edinburgh Mental Wellbeing Scale (WEMWBS) [4, 5], designed to capture broader outcomes and be applicable to more than one mental health condition. The WEMWBS has been assessed in individuals with schizophrenia but only as part of a mixed-diagnosis group [6], so it is unclear whether it is an appropriate measure for this population. Measures may also be categorised according to whether or not they are preference based; examples of preference-based measures include the three-level EuroQoL Five-Dimension (EQ-5D) [7, 8] and the Short-Form Six-Dimension (SF-6D) [9]. These preference-based measures are used to form a profile score that is converted into a preference-based index score (usually based on societal preferences) and thus allow economic evaluation of interventions using cost-utility analysis (CUA) to inform the allocation of resources by healthcare-governing agencies such as the UK National Institute for Health and Care Excellence (NICE) [10]. In CUA, QoL measured on a preference-based scale anchored at 0 (dead) to 1 (full health) is combined with length of life to generate quality-adjusted life-years (QALYs), allowing comparisons between interventions that affect quantity of life and/or QoL. However, while there can be “some confidence” in the use of generic measures (e.g., the SF-6D and EQ-5D) in patients with mood and anxiety disorders because of their demonstrated psychometric validity and responsiveness [11], this is not the case with schizophrenia. In patients with schizophrenia, the data are less conclusive [11,12,13], and it has been argued that preference-based measures focused on the impact of the mental disorder (rather than generic measures covering both physical and mental health) should instead be considered [11].

The SQLS, WEMWBS, EQ-5D, and SF-6D are all self-reported measures. Evidence suggests that, in people with schizophrenia, self-report should be interpreted with caution as condition-related aspects such as cognitive impairment may impair reliability [14]. Measures completed by observers (e.g., clinicians), such as the Positive and Negative Symptom Scale (PANSS) [15] and Clinical Global Impression-Schizophrenia Change (CGI-SCH) [16] have been developed to aid assessment of the burden of schizophrenia.

Each measure differs in its features and constructs, and at times there is value in using more than one measure when the choice of measure(s) is based on the purpose of assessment [17, 18]. When using more than one measure, the relationship between the measures should be considered, for instance, including a clinician-reported measure alongside patient-reported measures [19].

The overall aim of this analysis was to perform an exploratory assessment of the construct validity of patient-reported measures in the context of patients with schizophrenia to assess their potential utility when reflecting the burden of schizophrenia alongside clinician-reported measures.

2 Methods

2.1 Data

Data were drawn from a multicenter, non-interventional, cross-sectional survey designed to assess the burden of illness of people with persistent symptoms of schizophrenia [20] (ClinicalTrials.gov identifier: NCT01634542). The dataset included adult patients in the UK who had persistent schizophrenia symptoms despite receiving adequately dosed antipsychotic treatment and who had not had an acute exacerbation in the last 3 months. Respondents and clinicians completed questionnaires at their usual clinic visits.

2.2 Outcome Measures

2.2.1 Patient-Reported Measures

Patients completed the SQLS, a 30-item self-reported measure of schizophrenia-related QoL on a five-point item scale from 0 (best state) to 4 (worst state), with three subscales: (1) Psychosocial (15 items), (2) Motivation (and energy; seven items); and (3) Symptoms (and side effects; eight items). Each subscale has a range from 0 (best possible state) to 100 (worst possible health state), whereby the scoring and rescaling for each subscale follows the methods described by Wilkinson et al. [3]: the scale score (SS) equals the total of raw scores of each item in the scale (RStot) divided by the maximum possible score of all the items in the scale (RSmax), all multiplied by 100; i.e., SS = (RStot/RSmax) × 100. They also completed the WEMWBS, a 14-item scale of positive well-being and psychological functioning [21] with scores ranging from 14 (worst state) to 70 (best state). Two generic HRQoL preference-based measures were included: EQ-5D and SF-6D (the latter derived from the SF-12 version 2 [9]). The EQ-5D has five dimensions with three severity levels, describing 243 health states [7], and a preference-based score ranging from 1 (best state) to -0.594 (i.e., 0 is equivalent to a state of dead, and negative values are equivalent to states valued as worse than dead) [8]. The SF-6D has six dimensions, with between four and six response levels, and preference-based score ranging from 1 (perfect health) to 0.345 (worst state) [9].

2.2.2 Clinician-Reported Measures

Clinician-reported measures included the PANSS; a 30-item, 7-point schizophrenia syndrome scale from 1 (absent) to 7 (extreme), which can be summed across or within three subscales: positive (seven items, score range 7–49); negative (seven items, score range 7–49); general psychopathology (16 items, score range 16–112); total score range 30–210 [15]. The CGI-SCH was used to assess the severity of schizophrenia (five items) in positive, negative, cognition, depression, and overall symptoms [16]. All items are rated on a 7-point scale, from 1 (normal/not ill) to 7 (among the most severely ill), with no overall score produced [16]. Negative symptoms associated with schizophrenia were assessed using the Negative Symptoms Assessment (NSA-4) [22] 7-point global rating scale (negative symptoms over the last 7 days). Finally, clinicians also completed the Health of the Nation Outcome Scales for Payment by Results (HoNOS-PbR) designed by UK psychiatrists for use in routine clinical practice as a record of a patient’s progress across 12 items [23]: behavior (three items), function (two items), symptoms (three items), and social (four items); a total score can be derived via summation of the 12 items.

2.3 Exploratory Validity Analysis

This analysis is considered exploratory as no a priori assumptions were made about the relationship between the clinician- and patient-reported measures; rather, the relationships identified will inform a discussion about the level of association between these measures, their uses in measuring the burden of schizophrenia, and areas for future research in a patient population with schizophrenia. Therefore, to explore the relationship of the patient-reported HRQoL measures in patients with schizophrenia versus clinician-reported measures, floor and ceiling effects and construct validity were explored.

Floor and ceiling effects are important to assess as, within cross-sectional studies, they can be used as a proxy of sensitivity in relation to how well a measure can detect changes in HRQoL. For example, if a large proportion of the sample is at the floor (a score representing the highest level of symptoms or poor functioning) or ceiling (a score representing no problems), this impairs the ability (sensitivity) of the measure to pick up decreases or increases in HRQoL, respectively. The presence of floor and ceiling effects and data distribution (histograms are presented in Appendix S1 in the Electronic Supplementary Material [ESM]) were used to select the appropriate statistical tests to assess construct validity; that is, if the data were non-normally distributed and affected by floor and ceiling effects, nonparametric tests were preferred to parametric tests.

Construct validity refers to the extent to which an instrument measures what it is intended to measure (e.g., HRQoL) compared with other indicators. To do this, a “gold standard” is required against which to assess the measures of interest. However, because no gold standard measure covering the full complex nature of “mental health” or schizophrenia exists, we used a range of clinician-reported measures as indicators. Two related types of construct validity were undertaken: convergent and known-group.

Convergent validity was used to assess the relationship strength between measures. This can be done using correlation analysis and locally weighted scatterplot smoothing (LOWESS) techniques. Here, correlation analysis indicates the degree to which the instruments measure related factors at the overall, dimension, or item level. Correlation associations were considered weak if scores were ≤ 0.3, moderate if 0.3 < 0.5, and strong if ≥ 0.5 [24]; statistical significance was defined at the 5% threshold level. Spearman’s rank correlation coefficient was used as a nonparametric test based on the score distribution across the measures (see Sects. 3.1 and 3.2). LOWESS is a form of nonparametric regression that plots a line of central tendency between two variables on a scatterplot, thereby visualizing the relationship across the possible score ranges. LOWESS captures general patterns in the relationship between two measures without making assumptions about their actual relationship.

Known-group validity assesses the extent to which scores on an instrument differ across groups where they are expected to differ (e.g., clinical severity indicators). This can be measured by calculating effect sizes (calculated as the difference in mean scores between two adjacent severity subgroups divided by the standard deviation of scores for the milder of the two subgroups) between groups that provide a standard indicator of the size difference. Cohen’s d was used to calculate standardized effect sizes and the p value calculated from the F statistic. Effect sizes ≤ 0.5 were considered small, 0.5 < 0.8 moderate, and ≥ 0.8 large [24]; statistical significance was defined at the 5% threshold level. For this assessment, a focused literature search was performed to identify already established clinically meaningful severity cut-off points for the clinical measures. If cut-off points could not be identified, ad-hoc cut-offs were established based on the score format and distribution of the measure to inform this exploratory analysis; that is, the authors assessed the score format (e.g., if based on a continuous or categorical scale) then identified what proportion of patients were distributed across this score range, whereby an ad-hoc cut-off was based on establishing an equal proportion of people between two or more groups such that enough variation existed in each group that a meaningful known-group effect size could be identified if one existed. Ad-hoc cut-offs were used for the PANSS, CGI-SCH, and the NSA-4, whereas established cut-offs from the literature were used for the HoNOS-PbR [25]. It should be noted that established cut-offs for the PANSS were identified from the literature, but were based on a percentage change in scores over time (i.e., requiring data to be collected from at least two time points) and therefore could not be used for this cross-sectional study [26]. These cut-offs are reported in the results, and the implications of using ad-hoc cut-offs are included in the discussion.

We assumed clinician-reported measures represent the “true state” of patients in the absence of any inherent “gold standard” for this assessment. Therefore, a moderate to strong/large correlation and known-group effect size between the clinician-reported and patient-reported measures suggests construct validity of the patient-reported measures. Convergent validity between different patient-reported measures may be used as an estimate of the coherence and consistency of patient report and thus the potential impact of invalid or random responses (e.g., due to schizophrenia-related cognitive impairment [14]) on the reliability of the overall construct validity assessment of the measures in this population.

Across the head-to-head assessments, the frequency of producing the strongest absolute correlation strength (ACS), largest absolute effect size (AES), and statistically significant results will be used to determine the best overall construct validity between patient-reported measures. Evidence to suggest the existence of floor-and-ceiling effects, or invalid or random responses (i.e. evidence of poor convergent validity between patient-reported measures), will be used to inform the suggested reliability of these construct validity results.

3 Results

Overall, 304 patients consented to the study; however, as the WEMWBS was included in the protocol later in the process, it was included for only 297 patients. A summary of patient and condition characteristics is provided in Table 1; measure scores and completion rates are presented in Tables 2 and 3. Based on the clinician-reported scores, these patients were defined as having a “mild” NSA-4 global negative rating score and “mildly” severe CGI-SCH symptoms.

Table 1 Summary of patient and condition characteristics (N = 304)
Table 2 Summary and descriptive statistics of patient reported-outcomes measure scores
Table 3 Summary and descriptive statistics of clinician-reported outcomes measure scores

3.1 Floor and Ceiling Effects

Floor and ceiling effect statistics are presented in Tables 4 and 5. In general, all of the clinical measures had evidence of ceiling effects; this was less apparent with the PANSS and HoNOS-PbR total, but not the subscale scores. The EQ-5D had issues with ceiling effects; this was less apparent for other patient-reported measures.

Table 4 Floor and ceiling effect assessment for patient-reported outcomes measures
Table 5 Floor and ceiling effect assessment for clinician-reported outcome measure

3.2 Convergent Validity

All correlation coefficients are presented in Tables 6 and 7. In summary, amongst the patient-reported measures, the SF-6D most frequently (six times) exhibited the strongest correlations against the clinician-reported measure scores, with the SQLS Psychosocial, WEMWBS, and EQ-5D exhibiting the strongest strength correlation three times each. The SQLS Symptoms subscale did not strongly correlate with any of the clinician-reported measures.

Table 6 Correlation coefficient matrix between patient and clinician-reported outcomes measures
Table 7 Correlation coefficient matrix between patient-reported outcome measures

Between patient-reported measures, all correlations were of moderate to strong strength and statistically significant (the exception being any correlations with the SQLS Motivation subscale, which were of weak strength and had one statistically non-significant correlation). This provides some evidence of coherent and consistent self-reporting of outcomes across these measures. The scatter plots and LOWESS lines provided further support of convergent validity between clinician-reported and patient-reported measures, the results and figures for which are described and presented in Appendix S2 in the ESM.

3.3 Known-Group Validity

All effect sizes are presented in Tables 8 and 9. Although the overall majority of effect sizes were small, the SQLS Psychosocial, WEMBS, EQ-5D, and SF-6D each indicated some medium to large effect sizes across the clinician-reported measures (but not with the PANSS); all medium to large effect sizes were statistically significant. The WEMWBS indicated a medium effect size between the NSA-4 “mild” and “moderate to severe” cut-offs and was the only patient-reported measure to indicate anything but a small effect size between the NSA-4 groups. Across clinician-reported measures, the WEMWBS, EQ-5D, and SF-6D more often indicated a larger effect size than the SQLS Psychosocial, Motivation, or Symptoms subscales.

Table 8 Testing the known-group validity between the patient-reported and PANSS and CGI measures
Table 9 Testing the known-group validity between the patient-reported and NSA-4 and HoNOS-PbR measures

4 Discussion

A total of 304 patients with persistent symptoms of schizophrenia were recruited to a UK-based cross-sectional survey. These exploratory results suggest that, when patient-reported measures (EQ-5D, SF-6D, WEMWBS, SQLS subscales of Psychosocial, Motivation, and Symptoms) were assessed against clinician-reported measures (PANSS, CGI-SCH, NSA-4, HoNOS-PbR), the patient-reported EQ-5D, SF-6D, WEMWBS, and SQLS Psychosocial subscale had moderate construct validity in patients with schizophrenia. There was less support for the construct validity of the SQLS Symptoms subscale and nearly no support for the SQLS Motivation subscale. There was also evidence of consistent reporting of outcomes between the patient-reported measures, which suggests that those with schizophrenia in this patient sample could report their HRQoL consistently across measures.

4.1 Floor and Ceiling Effects

The EQ-5D had some issues with ceiling effects, which is a common finding across different conditions [27]; these ceiling effects were less apparent for the other patient-reported measures. The high ceiling effect suggests that, for a proportion of patients, the dimensions of EQ-5D are not sensitive to their schizophrenia-specific ill health. The five-level EQ-5D-5L was developed in an attempt to improve sensitivity to changes in health and address the ceiling effect issues associated with the EQ-5D [28].

4.2 Construct Validity

Our results suggest that only the condition-specific SQLS Psychosocial subscale in some cases indicated better construct validity than the generic measures, depending on the clinician-reported measure used for analysis; however, the SQLS Motivation and Symptom subscales always indicated weak construct validity, and the SF-6D more often indicated the better construct validity than any other patient-reported measure. For the EQ-5D and SF-6D, these results were similar to previous studies in this patient population [29]. The identified statistically significant correlations between patient-reported measures suggested reasonably consistent reporting of HRQoL across self-reported measures, providing some reliability to the exploratory construct validity assessment.

A consistent result across these patient-reported measures was a lack of construct validity with schizophrenia-related negative symptoms (i.e., PANSS and CGI negative subscales), suggesting these measures may not be appropriate for assessing negative aspects of schizophrenia, a result echoing those of a previous study [17]. The WEMWBS was relatively more useful in assessing negative symptoms (i.e., CGI-SCH negative subscale and NSA-4), albeit this was not unexpected given it focuses on psychological wellbeing.

4.3 Implications for Clinical Research and Economic Evaluations

Although this study provides some exploratory evidence to suggest moderate construct validity for the generic measures used as part of the analysis, there is limited evidence of strong construct validity and so a new measure focused around mental health is still desirable to improve on the validity performance of these current measures. In particular, patient-reported measures compared with clinician-reported measures appear less able to capture the impact of negative symptoms.

Despite the use of a schizophrenia-specific measure (i.e., SQLS) to capture condition-specific aspects of schizophrenia, the construct validity results suggested the SQLS provided little benefit over the generic HRQoL measures included in the analysis (particularly the SF-6D). This would therefore suggest a need to develop patient-reported measures that better elucidate and quantify aspects of schizophrenia, such as negative symptoms [29]. However, regarding this previous point, the assumption that patient-reported measures are valid only insofar as they approximate to clinician-reported measures is a convenient assumption often used for statistical validation exercises but is flawed and highlights a rather patriarchal approach. Clinician- and self-reported measures are two different perspectives on an individual’s experience and can offer complementary (i.e., asking the clinician will produce different but complementary information to that produced by asking the patient) rather than substitutive or equivalent information (i.e., asking the clinician should produce similar information to that produced by asking the patient). It would be unusual for patient-reported and clinical-reported measure scores to show perfect agreement. This difference between patient- and clinician-reported outcomes has been attributed to aspects such as schizophrenia-related cognitive impairment, which impairs the patient’s ability to comprehend and report on their own condition [14]. However, individual experiences are not rendered more accurate through some ill-defined hermeneutic process of interpretation, and the subjective–objective discrepancy in measurement may not be considered as evidence of a failure of the subjective measure. The practical utility of the SQLS Motivation and Symptoms subscales, given the poor construct validity results, should be interpreted with this in mind. Classifying and quantifying the complexities of an individual’s experience is a significant challenge, especially with a poorly understood, ill-defined mental health disorder such as schizophrenia. This is further compounded by myriad internal and external factors that influence assessment, attribution, and communication by the patient, not least of which is the person’s “mood” at the time of responding to subjective measurement tools, which should be interpreted in this light [18].

4.4 Limitations

Patients in the sample were mainly those with mild symptoms, which negates the generalizability of these results to outcomes measurement in a more severe population. The lack of clinically meaningful cut-offs identified in the literature meant we relied on ad-hoc cut-offs to assess known-group validity. Therefore, these results may not be generalisable to other similar studies, and the cut-offs may have had an effect on the interpretation of known-group validity. However, for this exploratory analysis, the use of ad-hoc cut-offs provided informative results about the ability of the patient-reported measures to detect statistically significant effect sizes between groups, which can be compared with the results of future studies when and if clinically meaningful cut-offs are established for the PANSS, CGI-SCH, and NSA-4. Ceiling effects within particular measures (e.g., EQ-5D) may have affected correlation analysis, but the known-group validity was used to confirm results. The data were also cross-sectional, so change over time could not be assessed, which is an important aspect to assess in the context of economic evaluation.

As stated in the introduction, evidence suggests that, in people with schizophrenia, self-report should be interpreted with caution, as condition-related aspects such as cognitive impairment may impair reliability [14]. This is an inherent concern when using self-reported data for outcomes assessment in patients with mental health conditions that could impair their ability to provide reliable responses; however, we explored the convergent validity between the different patient-reported measures as an estimate of the coherence and consistency of patient self-report and thus the potential impact of invalid or random responses (e.g., due to schizophrenia-related cognitive impairment [14]) on the reliability of the overall construct validity assessment of the measures in this population. This approach offered a practical solution to assessing the consistency of reporting between similar patient-reported measures when it was not possible to use other practical methods of assessing reliability in reporting within measures, such as assessing test–retest reliability, which assesses intra-observer reliability within measures by asking the person to complete the same measure twice at different (but chronologically close together) time periods [19]. Our analysis suggested that people in our study cohort had reasonably consistent reporting of outcomes between patient-reported measures, as represented by the moderate to strong correlation strength and statistically significant results; however, as we could not specifically control for cognitive impairment (e.g., using regression analysis controlling for a measure of cognitive impairment, such as the Mini Mental State Examination [30]), some aspects of cognitive impairment could have affected our overall assessment, which should be noted when interpreting our exploratory results.

Because of the large number of outcome measures collected, multiple statistical tests were performed, which can increase the likelihood of erroneous inferences. Statistical methods exist to control for issues associated with multiple testing (e.g., the Bonferroni correction [31]); however, because of the stricter statistical significance threshold associated with these methods, a large sample size is required to identify a statistically significant result. Trials tend not to be powered for multiple testing (as most trials are powered based on a single primary outcome), so, for this exploratory analysis using data from a cross-sectional survey study, 300 people were considered an adequate sample size within which corrections for multiple testing were not accounted, which is a limitation. Therefore, future studies may want to recruit larger sample sizes and apply statistical methods to control for multiple tests to confirm or refute the results identified within the current study.

4.5 Considerations for Future Research

Although this study provides exploratory evidence to suggest moderate construct validity for the generic measures used as part of this analysis, there is limited evidence of strong convergent validity, so a new measure focused around mental health is still desirable to improve on the validity performance of these current measures. Related to this point, preference-based measures that can be used for the purpose of clinical assessment and economic evaluation in the field of mental health research are also lacking. Two new measures (which were unavailable at the time of the study that produced the data for this analysis) have undergone initial validation and have been developed to assess QoL in people with different mental health conditions [32]: Recovering Quality of Life (ReQoL) measure with 10 (ReQoL-10; there are plans to make this version preference-based) or 20 (ReQoL-20) items. The intention is that these measures will “plug a gap” in capturing aspects of QoL important to people with mental health conditions (including those with schizophrenia) for clinical assessment and economic evaluations. Given that the results from this and previous studies have provided mixed evidence for the appropriateness of using the EQ-5D and SF-6D for economic evaluations [11,12,13, 29], using an alternative preference-based measure in patients with schizophrenia (such as the ReQoL-10) should be explored as part of future research. Understanding how the ReQoL measures compare with existing non-preference-based measures in this patient population (e.g., the SQLS and WEMWBS) will inform the measure’s use for clinical outcome assessment.

5 Conclusion

The exploratory results from this study suggest that the patient-reported measures showed moderate construct validity when assessed against clinician-reported measures for some aspects of schizophrenia severity but showed weak construct validity for the negative symptoms of the condition. In particular, the SF-6D had the best overall construct validity but showed a weak relationship with clinician-rated measures for negative symptoms. Compared with the EQ-5D, the SF-6D may better capture HRQoL aspects of schizophrenia for the purpose of economic evaluation. However, a new measure to assess the burden of schizophrenia is still desirable to improve on the psychometric properties of existing measures for this patient population. There was evidence of consistent reporting of outcomes between patient-reported measures, which provides exploratory evidence that patients with schizophrenia can self-report their HRQoL. There is a suggestion that the SQLS can be self-reported to capture complementary information relative to clinician-reported measures, which is desirable when quantifying the wider burden of schizophrenia.