In chronic medical conditions, small changes in clinical status must be readily identifiable to monitor patients' progress and to modify treatment strategies, if necessary [1]. In heart failure, for example, numerous clinical indicators are employed to monitor patients' health status over time, including physician assessments (e.g., New York Heart Association (NYHA) classification system), exercise capacity (e.g., six-minute walk test), fluctuations in body weight, and biomarkers [2]. Often, however, changes in patients' own perceptions of their health status may not be readily apparent to the clinician or may not be manifested in a manner that easily lends itself to these assessments. As a result, self-reported health-related quality of life (HRQL) measures are increasingly being used to provide complementary and additional insight into the health status of a patient or patients [36].

HRQL measures have been commonly used in the clinical trial setting in the evaluation of new treatment strategies, and more recently are even used as the primary outcome assessment for clinical trials [7, 8]. When used as the primary outcome, identification and quantification of subtle changes in a patients' health status is critical since the success or failure of the trial depends entirely on the HRQL measure. It is therefore essential that the HRQL measure be sensitive to small, but important, changes to determine if the treatment under study is effective or potentially harmful to the patient.

Several disease and generic measures of HRQL have been used in both clinical practice and in clinical trials of patients with heart failure. To accurately capture these changes in health status over time, the HRQL measure must have evidence of longitudinal validity or 'responsiveness' [9]. Responsiveness refers to the ability of a HRQL measure to capture true underlying change in the patients' health status over time [9]. Two approaches are commonly used to assess the responsiveness of HRQL measures. The first method, the distributional approach, imparts meaning to the HRQL score by evaluating the changes in the HRQL scores and their associated variability (i.e., standard deviation). Often distributional based methods establish the responsiveness of a HRQL measure by the degree of 'statistical significance' associated with the change score. Importantly, however, the interpretability of the data is completely dependent on the variability of the data and a 'statistically significant' change may not necessarily constitute a clinically important change (or vice-versa), thereby limiting their ability to evaluate responsiveness. The second method, anchored-based approaches, compares the changes in HRQL scores to other clinically meaningfully markers or anchors. Anchor-based approaches are often easier to interpret for clinical audiences than the distributional-based approaches. Importantly, however, the external anchor chosen must itself be a valid measure of clinical change.

Several factors may therefore influence the responsiveness of a HRQL measure, including, but not limited to, the content of the measure (i.e., disease-specific versus generic), the validity of the measure, the error associated with the HRQL scores, the indices used to determine the responsiveness (e.g., T-statistics, effect sizes, etc) and the external criterion or 'gold standard' used to identify subjects as changed or not changed. When considering evidence of responsiveness, it is important to consider the extent to which these factors affect reported estimates of responsiveness [9].

To provide empirical evidence to support this issue, and to assist in the selection and interpretation of health status measures for heart failure, we analyzed longitudinal HRQL data in patients with heart failure to evaluate the relative responsiveness of selected disease-specific and generic HRQL measures. We explicitly compared their relative performance as measured by common responsiveness indices and by different external clinical criteria for change.



Patients with heart failure were recruited through the Cardiovascular Outcomes Research Consortium across 14 medical center outpatient departments in the United States and Canada [2]. All subjects were 30 years of age or older with documented left ventricular systolic dysfunction (left ventricular ejection fraction < 0.40). There were no exclusions, particularly with respect to the upper age limit. The subjects included in the study were typical of heart failure subjects in an outpatient setting. This was a cohort study aimed at evaluating the random changes observed in heart failure patients in the outpatient setting. No specific intervention was studied in these patients during the follow-up period.

HRQL and clinical measures

Patients completed several HRQL questionnaires at baseline, including the RAND12, EQ-5D, and the Kansas City Cardiomyopathy Questionnaire (KCCQ).

The RAND12 is a short, well-validated, generic measure of health status [1012]. The patients overall physical and mental health status was evaluated using the United States (US) population standardized Physical Component Summary (PCS) and the Mental Component Summary (MCS) scores [10, 11].

The EQ-5D is a 5-item self-administered utility measure [13]. In addition to the 5 health state items, the EQ-5D also contains a visual analog scale (EQ-VAS). Utility scores are generated using the time-trade off approach where the responses to the five items are valued based on general population valuation scores. Health state valuations are available for both the United Kingdom (UK) [13] and more recently the US [14, 15].

The KCCQ is a 23-item heart failure specific questionnaire. The KCCQ has domains on physical limitations, heart failure specific symptoms (e.g., swelling, shortness of breath, fatigue), quality of life, social impact of the disease, and patients' assessments of their disease knowledge or self-efficacy [16]. The psychometric properties of the measure have been previously established [2, 16]. In addition to domain scores, the KCCQ generates two summary measures, the KCCQ Clinical Summary Score (capturing patients' physical function and symptoms) and KCCQ Overall Summary Score (including the physical and social function, symptoms and quality of life domains). These summary scores were used for all analyses involving the KCCQ.

In addition to the HRQL questionnaires, several clinical assessments were also completed by the residing cardiologist at baseline. Specifically, the patients baseline New York Heart Association (NYHA) classification was assessed and patients completed a six-minute walk (6 MW) test according to a standardized protocol [17]. All patients returned to the medical center after approximately 6 weeks for repeat assessment by the cardiologist. All measures, including HRQL questionnaires (i.e., RAND12, EQ-5D, and KCCQ), NYHA classification, and 6 MW tests were re-evaluated.

Criteria for clinical change

There is no universally accepted gold standard for identification of clinical change in patients with heart failure. As a result, change in the patients' clinical status was assessed using several criteria. First, cardiologist's assessment of the patients' NYHA classification at baseline and at the 6-week follow-up was used. Subjects were classified as improving two NYHA classes (e.g., improving from NYHA class IV to NYHA class II), improving one NYHA class, no change in NYHA class, and deteriorating one NYHA class from the baseline to 6-week follow-up. No subjects deteriorated by two NYHA classes during the 6-week period.

Second, the resident cardiologist, blinded to the subjects self-reported HRQL and the 6 MW test results, completed a previously validated global rating of change assessment from the baseline to the 6-week follow-up visit [2, 18]. This provides a 15-point Likert scale ranging from extremely worse (-7), no change (zero), to extremely better (+7). Subjects were classified into 5 mutually exclusive change categories: substantially improved (+7, +6, +5), moderately improved (+4,+3,+2), no change (+1, 0, -1), moderately deteriorated (-2, -3, -4), and substantially deteriorated (-5, -6, -7).

Finally, the difference from baseline to 6-weeks in distanced traveled in the 6 MW test was recorded. This difference was categorized into 7 mutually exclusive categories of clinical change according to previous research [2]: substantially improved (≥ +100 meters); moderately improved (+50 to +99 meters), small improvement (+25 to +49 meters), no change (+24 to -24 meters), small deterioration (-25 to -99 meters), moderately deteriorated (-100 to -199 meters), and substantial deterioration (≤ -200 meters).


Mean change scores in the HRQL measures were calculated by subtracting the baseline score from the 6-week follow-up data. Responsiveness indices, including T-statistic (mean change divided by standard deviation for total group), effect size (ES) (mean change divided by the standard deviation of the baseline score), the Guyatt's responsiveness statistic (GRS) (mean change divided by the standard deviation of change in subjects who remained unchanged), and the standardized response mean (SRM) (mean change divided by the standard deviation of the change score) [9], were calculated for each of the HRQL measures (i.e., RAND12–MCS and PCS Scores, EQ-5D–US, EQ-5D–UK, and EQ–VAS, and the KCCQ–Overall Summary Score and Clinical Summary Score) between patients who changed and remained stable from baseline to 6-weeks [9]. The responsiveness indices were calculated for each of the HRQL measures according to the degree of change as identified by the three primary external indicators of heart failure status change.

To facilitate comparison, the median rank of each HRQL measure was determined across each of the four responsiveness indices [19]. In order for a HRQL measure to be a valid measure of clinical change, the measure must be capable of capturing both improvements and deterioration in clinical status. As a result, categories for improvement and deterioration were combined to provide an overall single median rank, depicting the overall relative responsiveness of the HRQL measure to changes in heart failure clinical status. Of note, the HRQL scores in subjects who remained stable (i.e., no change categories) were not included in the calculation of the overall single median rank of responsiveness.


A total of 476 subjects were enrolled in the study and provided baseline and 6-week follow-up [2]. Of these subjects, 298 had complete data and were included. Subjects included in the analysis did not differ in age, sex, body mass index, comorbidities, heart failure symptoms, or baseline health status compared to the total cohort (p > 0.05 for all comparisons). Subjects were mainly elderly, male, overweight, and had a significant history of other comorbidities (Table 1). Most subjects were in NYHA class II and III heart failure indicating moderate to severe heart failure symptoms. Overall, baseline scores indicate subjects reported substantial deficits in their HRQL consistent with the clinical indicators of their condition (Table 2).

Table 1 Baseline Characteristics
Table 2 Baseline Health Status and Clinical Measures

Classification according to clinical criteria

Upon follow-up (average 6 ± 2 weeks), 52 (17%) subjects were classified as improved according to the NYHA class (Table 3), 60 (20%) subjects improved according to the global rating of change (Table 4), and 101 (34%) subjects improved according to the 6 MW test (Table 5). Conversely, 40 (13%) subjects were classified as deteriorated according to the NYHA class, 32 (11%) subjects deteriorated according to the global rating of change, and 83 (28%) subjects deteriorated according to the 6 MW test. Overall, 206 (69%) subjects were classified as not changed according to both the NYHA class (Table 3) and the global rating of change (Table 4) and 114 (38%) according to 6 MW test (Table 5).

Table 3 Baseline, 6 week, and mean change scores according to change in health status according to the external criterion: New York Heart Association Classification
Table 4 Baseline, 6 week, and mean change scores according to change in health status according to the external criterion: Global Rating of Change
Table 5 Baseline, 6 week, and mean change scores according to change in health status according to the external criterion: Six-Minute Walk Test (meters)

HRQL responsiveness to change

Overall, the magnitude of change in the HRQL scores were larger for subjects who improved compared to subjects who deteriorated over the 6-week period, with the exception of the 6 MW data, which showed the opposite trend (Table 5). Importantly, however, fewer subjects were classified as having deteriorated over the follow-up period, irrespective of the external criterion used (Tables 3, 4, 5). As expected, relatively small changes in HRQL scores occurred in subjects who were classified as having not changed during the follow-up period on the external clinical criterions.

Similar to the raw HRQL change scores, the magnitude of the responsiveness indices were influenced by the direction of clinical change (Tables 6, 7, 8). Overall, the responsiveness indices were larger for subjects who improved during the follow-up period compared to individuals who deteriorated, irrespective of the responsiveness index calculated. For example, the T-statistic for the EQ-5D–US Scoring system for subjects who improved substantially (i.e., +5, +6, +7) on the global rating of change criterion was 2.13 compared to only 1.00 for subjects who substantially deteriorated (i.e., -5,-6,-7) (Table 7). Similar trends were observed in the other HRQL measures.

Table 6 Responsiveness statistics and relative ranking of selected HRQL measures according to change on the external criterion: New York Heart Association
Table 7 Responsiveness statistics and relative ranking of selected HRQL measures according to change on the external criterion: Global Rating of Change
Table 8 Responsiveness statistics and relative ranking of selected HRQL measures according to change on the external criterion: Six-Minute Walk Test

In general, the relative ranking of the HRQL measure within each responsiveness index was similar, regardless of the responsiveness indices used (Tables 6, 7, 8). Irrespective of the responsiveness index used, the KCCQ Clinical Summary Score and Overall Summary Score were consistently ranked as the most responsive measures (Figures 1, 2, 3). Interestingly, the KCCQ Overall Summary Score was ranked as the most responsive measure according to the global rating of change (Figure 2) and 6 MW (Figure 3), but not with respect to the NYHA classification (Figure 1). This is not surprising as the NYHA Classification is focused mainly on the 'clinical' aspects of heart failure (symptoms and function). As a result, this criterion is more attuned to the KCCQ Clinical Summary Score compared to the KCCQ Overall Summary Score, which also includes the domains of social function and quality of life.

Figure 1
figure 1

Overall Relative Rank for Selected HRQL Measures According to External Clinical Criterion: New York Heart Association.

Figure 2
figure 2

Overall Relative Rank for Selected HRQL Measures According to External Clinical Criterion: Global Rating of Change.

Figure 3
figure 3

Overall Relative Rank for Selected HRQL Measures According to External Clinical Criterion: Six-Minute Walk Test.

Although small differences existed, the relative ranking of the generic HRQL measures across the responsiveness indices were similar and generally differed by only one rank position. For example, the EQ-5D–US Scoring system for subjects who improved +2 NYHA classes had a relative ranking of '3' using the T-statistic, ES and GRS and a rank of '4' using the SRM. Thus, the relative responsiveness of a disease-specific versus generic HRQL measure was not substantively influenced by the choice of responsiveness index. Differences did exist, however, in the relative ranking of generic HRQL measures according to the external clinical criterion of change used (Tables 6, 7, 8; Figures 1, 2, 3). The observed differences were relatively small, however. In general, the EQ-5D scoring systems and the RAND12 PCS scores were not as responsive as the RAND12 MCS score or the KCCQ scores.


In this study, we found that the disease specific KCCQ measure was more responsive to underlying clinical change than either the generic EQ-5D or RAND12 measures. The greater responsiveness of the KCCQ was consistent across all four responsiveness indices used and across the three external clinical criteria of change. Importantly, we also found that the methods and definitions used to define true underlying change can have a major influence on the perceived responsiveness of a HRQL measure [19, 20], particularly for the generic HRQL measures. Our results showed that the generic HRQL measures may be highly responsive when compared to one clinical anchor yet less responsive with a different clinical anchor. These effects were less apparent with the KCCQ. As a result, different methods used to judge whether an important change has occurred in a patient can lead to different conclusions regarding the responsiveness of a HRQL measure. These are important results, given the broad range of approaches that may be applied in the assessment of responsiveness [9].

It is important to understand and interpret how changes in the HRQL scores over time reflect true underlying change in patients with heart failure. Clinicians are interested in determining if the difference in HRQL scores over time signifies a trivial, small but clinically important, moderate, or a large change in HRQL [5, 21]. This information is critical in guiding clinical decisions with respect to the patients' management. Furthermore, in the clinical trial setting, this information is equally important to determine if there is sufficient evidence to support a new treatment modality.

The high responsiveness of the KCCQ may have been expected as specific measures, by design, typically have very strong content validity for a specific disease or population. It is generally accepted that when true change occurs in the setting of a clinical trial, disease-specific measures are more responsive to this change as compared to generic measures of HRQL [22]. They are often perceived to be more clinically relevant and 'sensible' to both patients and clinicians [23]. Disease specific measures generally also explore a single domain in greater depth compared to a corresponding domain in generic measures [24].

The KCCQ, for example, specifically focuses on the impact of dyspnea, a prominent complaint for people with heart failure. In the RAND12, dyspnea could be captured, but only in much broader terms under the domain of physical functioning. Thus, disease-specific measures may be more sensitive and responsive to within-patient change as compared to generic measures [22]. This change in HRQL is often easier to identify using disease specific measures since changes observed on the measure are often more closely associated with changes in clinical measurements which are familiar to clinicians [25]. In addition, the lower responsiveness of the generic measures may also be due, in part, to the presence of other competing comorbidities that can influence generic measures, other than the severity or changes in the patient's heart failure. As a result, the stronger responsiveness of the KCCQ may have been expected.

While our data would suggest that the use of a disease-specific measure, like the KCCQ, may provide the best opportunity to capture small but highly relevant clinical changes in heart failure patients, not all changes in HRQL may be captured with the use of a disease-specific measure, as often the overall effects of a new treatment may not be fully known. For example, we were intrigued by the relative performance of the physical and mental health summary scores of the RAND12. Although heart failure is often perceived largely as a physical disease, the RAND12 PCS Score performed very poorly compared to the other HRQL measures. The reason for the lower responsiveness of the RAND12 PCS is not known but may be related to the underlying health status of the study population. RAND12 PCS scores were, on average, 1.5 standard deviations below the standardized US population mean at baseline indicating that the subjects had significant physical deficits (Table 2). As a result, it is possible that subjects were too limited, physically, to change over the 6-week follow-up period significantly. Alternatively, since patients recruited in the study were from the outpatient setting, patients would be expected to have 'relatively' stable heart failure symptoms. As a result, significant physical changes may not necessarily be expected over the 6-week period. Interestingly, however, the RAND12 MCS was relatively more responsive to changes in heart failure status. The impact of mental health and its monitoring in patients with heart failure warrants further investigation [26].

Also of note was that the relative responsiveness of the HRQL measures depended on the direction of clinical change. Overall, the HRQL measures were more responsive to improved clinical status as compared to deteriorating clinical status. This may be related to the fact that heart failure patients in this study were quite severely affected by their disease and had substantial deficits in their HRQL. As a result, HRQL measures may be susceptible to 'floor effects' in this population, which may limit their ability to capture deterioration in clinical status. The observation that no patient deteriorated by 2 NYHA classes further supports this hypothesis. Thus, not only can the responsiveness indices and external criterion standards influence instrument responsiveness, but also it is important to consider the population studied.

Irrespective of the scope of the HRQL measure or the direction of clinical change, the responsiveness of a particular measure may also be influenced by the responsiveness index used. Although the responsiveness indices used in this study provided similar rankings, differences did exist among the responsiveness indices when the generic HRQL measures were concerned. In this study, four commonly utilized responsiveness indices were used [9, 27]. Terwee et al. have shown that there are over 30 different responsiveness calculations which have been described in the literature to identify change in a patient's HRQL [9].

Although most indices use the mean change in HRQL over time, there are significant differences in how the standard deviations or variability in the data is used in the calculation. For example, the GRS was calculated using the standard deviation of the change scores among subjects who are clinically stable, whereas the SRM uses the standard deviation of the change scores. It is therefore possible that significant differences could exist in the variability in the selected subgroups, resulting in differences in the perceived responsiveness of the HRQL measure depending upon the responsiveness index chosen.

Our results and interpretation should be considered in light of several potential limitations. First, as discussed, the ability to identify patients who truly changed during the follow-up period was subjective, as no 'gold standard' exists for patients with heart failure. We did apply, however, the NYHA classification system, physician global rating of change assessment, and 6 MW tests, which are well validated and common methods to identify change in clinical status in patients with heart failure [2, 17, 18]. Furthermore, the same cardiologist evaluated subjects at both the baseline and the 6-week follow-up improving internal consistency of these change ratings and may provide the most appropriate method for validating the HRQL measures ability to identify true clinical change in this population [2]. Second, the categorical cut points used in the 6 MW data to indicate the magnitude of clinical change may have affected the responsiveness results. We did reanalyze the data using different categorical cut points, but obtained the same relative ranking of the HRQL measures. As a result, it is unlikely that changes in the cut points for the clinical change categories for the 6 MW data would significantly alter our results. Third, fifteen subjects were missing either baseline or 6-week EQ-VAS scores. It is possible that restricting the sample to subjects who had complete responses on all the HRQL measures may have changed the relative ranking of the HRQL measures. We feel this is unlikely, however, as the majority of subjects missing EQ-VAS scores were classified as not changing over the 6-weeks according to the external criterions. Thus, the effect of the missing data on the responsiveness rankings would be minimal. Fourth, we chose to evaluate four of the more commonly used responsiveness indices in this study. Numerous other responsiveness indices have been reported in the literature. While it may be possible that different results would have been observed if other responsiveness indices were used, we think that is unlikely. Finally, the duration of follow-up was relatively short and may have affected the estimated responsiveness of the HRQL measures. Each HRQL measure uses a different period ranging from 'today' (EQ-5D) to 'within the past 4 weeks' (RAND12) to assess the patients' health status. The 6-week follow-up period was initially chosen to improve the recall accuracy of the cardiologist evaluating the patient, yet long enough to allow meaningful clinical change to occur [2].

With these limitations in mind, we believe this study highlights the importance of considering the measurement properties and instrument content in selecting HRQL measures for clinical trials. It is important to have a measure capable of detecting small but highly relevant and important changes in HRQL, especially when the HRQL outcome of interest represents the primary outcome or main secondary outcome of the trial. Furthermore, in the design of a clinical trial, researchers must be confident that the HRQL measure is responsive to the minimal importance difference that was hypothesized during the study design. The sample size required and power of the study will be directly related to the minimal importance difference that the researchers wish to detect. However, while disease specific measure may be considered as the first choice for a primary HRQL outcome, a combination with a generic HRQL measure may still be desirable to fully assess HRQL outcomes in clinical trial settings. Further, if the therapy under consideration will be evaluated under a cost-effectiveness framework, it would be desirable to include a measure compatible with that framework, such as a utility-based measure [28].


We found the disease specific measure, the KCCQ, was the most responsive HRQL measure assessing change over a 6-week period, although generic measures provide information for which the KCCQ is not suitable. We noted that the responsiveness of generic HRQL measure may be affected by the responsiveness index used, as well as the selection of the external criterion to identify patients who have clinically change or remained stable.