Quality of Life Research

, Volume 22, Issue 3, pp 475–483 | Cite as

Methods for interpreting change over time in patient-reported outcome measures

  • K. W. Wyrwich
  • J. M. Norquist
  • W. R. Lenderking
  • S. Acaster
  • the Industry Advisory Committee of International Society for Quality of Life Research (ISOQOL)
Article

Abstract

Purpose

Interpretation guidelines are needed for patient-reported outcome (PRO) measures’ change scores to evaluate efficacy of an intervention and to communicate PRO results to regulators, patients, physicians, and providers. The 2009 Food and Drug Administration (FDA) Guidance for Industry Patient-Reported Outcomes (PRO) Measures: Use in Medical Product Development to Support Labeling Claims (hereafter referred to as the final FDA PRO Guidance) provides some recommendations for the interpretation of change in PRO scores as evidence of treatment efficacy.

Methods

This article reviews the evolution of the methods and the terminology used to describe and aid in the communication of meaningful PRO change score thresholds.

Results

Anchor- and distribution-based methods have played important roles, and the FDA has recently stressed the importance of cross-sectional patient global assessments of concept as anchor-based methods for estimation of the responder definition, which describes an individual-level treatment benefit. The final FDA PRO Guidance proposes the cumulative distribution function (CDF) of responses as a useful method to depict the effect of treatments across the study population.

Conclusions

While CDFs serve an important role, they should not be a replacement for the careful investigation of a PRO’s relevant responder definition using anchor-based methods and providing stakeholders with a relevant threshold for the interpretation of change over time.

Keywords

Patient-reported outcome Interpretation Anchor-based Distribution-based Change over time Quality of life Cumulative distribution function Minimal important difference Responder definition 

Abbreviations

AQLQ

Asthma Quality of Life Questionnaire

CDF

Cumulative distribution function

CHQ

Chronic Heart Failure Questionnaire

CRQ

Chronic Respiratory Questionnaire

ECOG

Eastern Cooperative Oncology Group

ES

Effect size

FDA

Food and drug administration

IAC

Industry Advisory Committee

ISOQOL

International Society for Quality of Life Research

MCID

Minimal clinically important difference

MID

Minimal important difference

PRO

Patient-reported outcome

QOL

Quality of life

Introduction

The use of patient-reported outcome (PRO) measures in research studies, clinical trials, and clinical practice has risen dramatically over the last 30 years and will continue to rise as health care assessments become increasingly patient centered [1]. The development or selection of the appropriate PRO instrument for measuring the most relevant and appropriate endpoints requires attention to the conceptualization of the measure, the PRO’s content validity as well as measurement properties of reliability, other validity, and ability to detect change. However, once PRO measures with established and acceptable measurement properties have demonstrated statistically significant changes, further research to establish benchmarks for interpretation of results is necessary. Interpretation guidelines are required for change scores to evaluate efficacy of an intervention and communicate PRO results to regulators, patients, physicians, and providers [2]. In 2009, the U.S. Food and Drug Administration (FDA) published the final Guidance for Industry Patient-Reported Outcomes (PRO) Measures: Use in Medical Product Development to Support Labeling Claims (hereafter referred to as the final FDA PRO Guidance) [3] that included specific recommendations for the interpretation of change in PRO scores as evidence of efficacy of treatments. Despite the usefulness of this document, the Agency’s recommendations continue to evolve over time [4].

This article provides a historical perspective and a focused review of key publications in the evolution of methods used for interpreting treatment effects from endpoints designed to provide evidence for FDA medical product labeling claims based on PRO measures. These interpretation methods for longitudinal clinical trial results have been developed and debated over several decades and include anchor-based and distribution-based methods for interpreting change over time and establishing interpretation guidelines, as well as the use of cumulative change distribution curves. Evolution in terminology for describing what is a meaningful improvement in a PRO endpoint is reviewed to provide a historical context for some of the many terms to describe an important threshold for changes in PRO endpoints. Finally, the current challenges and recommendations to improve the understanding of PRO trial results that balance the interpretation needs of many stakeholders in the medical product development process are discussed, while recognizing that two other important aspects of PRO interpretation—response shift and proxy respondents/measurements—are outside the scope of this article.

History

Historically for many clinical measurements or examinations, extensive patient experience was usually a feasible and valid way for physicians to assess the significance of instrument score changes over time. However, since most PRO measures are predominantly used as research tools, not clinical practice instruments, there may be lack of such experience to assess the meaningfulness of a change. In addition, changes in PRO scores are usually expressed as units on an abstract scale that need to be correlated with something more interpretable in order to acquire meaning. Moreover, because statistical significance does not guarantee that observed differences between treatments or within an individual over time are important or meaningful to patients, there is a need to provide a systematic approach to document what level of change in a PRO measure is important to patients.

To address these concerns, Jaeschke et al. [5] were the first to introduce the term minimal clinically important difference(MCID) for PRO instruments in 1989 to indicate the “smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive cost, a change in the patient’s management.” To determine an MCID, the authors used data from three studies [6, 7, 8] that included either the Chronic Respiratory Questionnaire (CRQ) [9] or the Chronic Heart Failure Questionnaire (CHQ) [10]. These questionnaires differ on a single item, and both assess the domains of dyspnea, fatigue, and emotions using a 7-point Likert scale. In addition, in all three studies, at follow-up visits, patients were asked a “global rating of change” question (with possible ratings from −7, “a great deal worse,” to +7, “a great deal better”) for each of the three domains to assess whether they had experienced change since the start of their treatment. By using the CRQ and CHQ data and the global rating of change responses as the anchors to define small, medium, and large changes (see details on anchor-based methods in section “Review of methods” below), Jaeschke and colleagues concluded that an MCID for both instruments was approximately 0.5 per item, which was also consistent with consensus expert opinion at that time.

In the 1990s, several researchers applied this important change methodology suggested by Jaeschke and colleagues, including other McMaster University colleagues investigating these thresholds for the Asthma Quality of Life Questionnaire (AQLQ) [11]. This 1994 study improved on the 1989 study’s methodology by: 1) investigating whether the magnitude of important improvements are similar to important deterioration levels and (2) recognizing that the process did not incorporate any clinical assessments or judgments and therefore represent the minimal important difference (MID), not the MCID. In their conclusion, the Juniper and colleagues authors demonstrated that the 0.5 MID is important when the AQLQ is used to examine within-patient changes (both improvements and declines), but the same threshold does not necessarily apply when examining differences between patients, and presumably, between patient groups [11].

With the endorsement of the International Society for Quality of Life Research (ISOQOL), in 2000, a team of 30 researchers (the Clinical Significance Consensus Meeting Group) assembled in a consensus writing meeting held at the Mayo Clinic, Rochester, MN (USA), to address the clinical significance of quality of life (QOL) measures in oncology research and practice. In addition to an overview paper, this group produced six articles that delineated in simple language the degree of consensus that existed with regards to clinical significance, the areas of disagreement, and the areas where further discussion is needed [12, 13, 14, 15, 16, 17, 18]. The six papers summarized in Table 1 clearly indicate that: (1) there was more consensus than controversy regarding how clinical significance could be ascertained and conveyed; and (2) no single statistical rule or procedure could take the place of well-reasoned considerations of all aspects of the data. Overall, this consensus writing meeting and related publications were a step forward in providing an impetus for inclusion of PRO endpoints into clinical research and practice, but they did not resolve the issue of what is the best method for determining a meaningful difference in PRO change scores.
Table 1

Key meetings and communications related to the interpretation of change over time for PRO measures

Year

Meeting title and location

Paper topic and reference

Main conclusion

2000

Symposium on the Clinical Significance of QoL Measures in cancer Patients

Mayo Clinic, Rochester, MN (USA)

1. Methods to explain the clinical significance of health status measures [13]

Distribution-based methods estimates may not suffice on their own but are useful if consistent with anchor-based results

Anchor-based results generated from only one anchor may need to be supplements with validation of alternative anchors

More work is needed on interpretation approaches if they are to be used by clinicians in their day-to-day practice

  

2. Group versus individual clinical significance differences [14]

Group-level data may be used to guide decisions about individual changes but not without the presence of measurement error

Multi-anchor approach is suggested to establish individual clinical significance in which patient self-report, individual preferences, clinical expectations, and empirical behaviors are included in the interpretation of change, giving the patient self-rating the highest weight

  

3. Single items versus summated scores [15]

If a detailed description of the construct of interest is required then multi-item indices may be required. On the other hand, if only a global impression of QoL is needed, then a single measure or item score may be sufficient

The clinical significance of QoL assessments does not depend on how the score in constructed and the same methodologies and interpretation can be applied to both single-item global measures or multi-item indices

  

4. Patients, clinicians and population perspectives on clinical significance of HRQL data [16]

More research is needed to create clear QoL interpretation guidelines for clinicians and health service providers

More patient input is required on what constitutes a clinically important difference. Clinicians can then use this as a guide in their clinical practice

  

5. Assessing change over time [17]

A checklist was created as a guide for clinicians to critically assess and interpret longitudinal QoL data and use in the treatment decision process

  

6. Interpreting the clinical significance of HRQL results from 2 perspectives: clinical trials and clinical practice [18]

No universal approach can determine the clinical significance of HRQL data for both research and practice settings. A difference can be meaningful for one context, but not for the other

2006

FDA Guidance on Patient-Reported Outcomes: Discussion, Dissemination, and Operationalization

Chantilly, VA (USA)

FDA Perspective on PROs to support medical product labeling claims [22]

In the paper by Patrick et al., a conceptual distinction is made on interpretation of PRO data depending on the how the patient response to treatment is being measured by the PRO: (a) comparison of the average change from baseline across all patients in treatment and control group according to the between-group criteria or MID; (b) comparison of the proportion of patients in each group who meet the prespecified criterion for response or “responder criteria”

  

Interpreting and Reporting Results Based on Patient-Reported Outcomes [21]

This paper focuses on issues associated with assessing clinical significance and common pitfalls to avoid in presenting results related to PROs. Specifically, the questions addressed by this manuscript involve: What are the best methods to assess clinical significance for PROs? How should investigators present PRO data most effectively in a Food and Drug Administration (FDA) application? In labeling or in a scientific publication?

The FDA draft PRO Guidance, issued in February 2006, provided some text on the complicated issue of methods for interpretation of results generated from PRO measures for medical product labeling claims [19]. In the draft PRO Guidance, the MID was presented as an approach to facilitate interpretation of clinical trial results of PRO endpoints. For widely used measures such as treadmill distance or the Hamilton Depression Rating Scale [20], it was suggested that the ability to show any difference was treated as evidence of a relevant treatment effect. However, the draft PRO Guidance suggested that PRO instruments may be more sensitive than past measures, and thus, an MID benchmark can serve as a guide for interpreting mean differences. The concept of “mean effect” (i.e., differences between group means or MID) was distinguished from a responder effect criterion, defined as “change in an individual that would be considered important” (line 542). The draft PRO Guidance reviewed methods to derive MIDs, including mapping changes in a PRO score to non-PRO measures or to other PRO scores such as global impression of change, distribution-based approaches (see details on distribution-based methods in section “Review of methods” below) and empirical rules (e.g., a fixed percentage of a theoretical range). At the same time, the draft PRO Guidance noted that distribution-based approaches and empirical rules were problematic because these approaches do not directly reflect patient preferences or assessments of meaningful change and do not address what magnitude of treatment differences may be clinically meaningful.

Finally, the draft PRO Guidance stated that “there may be situations where it is more reasonable to characterize the meaningfulness of an individual’s response to treatment than a group’s response.”(lines 571–572). Therefore, this document suggested that it would be acceptable to categorize a patient as a responder based on prespecified criteria backed by empirically derived evidence of the responder definition. It is important to note that in the draft PRO Guidance, the FDA specifically asked for comments on the need and appropriate review standards for the MID and responder definitions applied to PRO instruments in the context of drug development.

In February 2006, a meeting titled “FDA Guidance on Patient-Reported Outcomes: Discussion, Dissemination, and Operationalization” was held in Chantilly, VA, USA. The intent of the meeting, organized jointly by the Mayo Clinic and the FDA, was to: (1) facilitate review and discussion of the FDA draft PRO Guidance among diverse stakeholders and (2) generate a supplement to the FDA draft PRO Guidance that would provide detail and exposition that was not possible to communicate in the relatively brief 36-page document. Based on discussions during this meeting, two key articles that addressed interpreting and reporting PRO data were published in Value in Health (Table 1) [21, 22]. The meeting and related publications provided an elaboration of the contents of the FDA draft PRO Guidance, including interpretation of PRO data, from the viewpoint of experts in academia, industry, clinical research, clinical practice, and FDA reviewers.

With the 2009 release of the final FDA PRO Guidance, FDA clarified concerns about interpretation of individual versus group PRO score change over time [3]. This most recent PRO Guidance focused on individual responses using an a prioriresponder definition representing “the individual patient PRO score change over a predetermined time period that should be interpreted as a treatment benefit.” [3] (p. 24). Use of empirical evidence derived from anchor-based methods was proposed as the appropriate methodology to explore associations between the targeted concept of the PRO instrument and the concept measured by the anchors. Different types of anchors such as clinical measures and patient global ratings of change were suggested. The final FDA PRO Guidance indicated that distribution-based approaches should be considered supportive in determining clinical significance of particular score changes and “are not appropriate as the sole basis for determining a responder definition.” [3] (p. 25).

The final FDA PRO Guidance also proposed the alternative of presenting the entire distribution of a clinical trial’s change scores for both treatment and control groups as a cumulative distribution graph (see details on cumulative distribution function of responses in section “Review of methods” below). The cumulative distribution function avoids the need to define a responder and allows for the evaluation of the entire distribution of change scores for both treatment and control groups.

Thus, the FDA’s draft and final PRO Guidance distinguished between individual change and group differences and emphasized the need for an empirically determined definition of a meaningful individual-level change threshold as the desired approach to interpreting change over time in PRO scores. Moreover, in the final PRO Guidance, any reference to the term MID as the interpretation of group change score differences was removed. The evolution of the specific methods that elucidate the best estimate for a responder definition within a specific clinical trial setting is described below.

Review of methods

Anchor-based approach

The categorization of anchor-based and distribution-based methods for interpreting PRO scores was a taxonomy first proposed in 1993 by Lydick and Epstein [23]. Anchor-based methods were considered those that explore the association between the targeted concept of a PRO instrument and the same or closely related concept measured by an independent anchor or anchors. Hence, changes seen in the PRO instrument are compared, or anchored, to changes on the anchoring item or measure. Potential anchors fall into three broad categories: patient ratings, clinician ratings, and direct clinical anchors. In all instances, it is imperative that any selected anchor should have intuitive meaning, be easier to interpret than the PRO instrument itself, and have an appreciable association with the PRO instrument. The minimum magnitude of this association between the PRO change scores and the anchor, however, is currently under debate, with recommended correlations of at least r = 0.3 [24] or r = 0.5 [25] in absolute value.

The most commonly reported anchor-based method is that first suggested by Jaeschke et al. [5], based on within-patient global ratings of change. This method involves asking patients to rate overall how much change they experienced on a PRO concept between two relevant time points (e.g., baseline and end of study) as “about the same” or on a gradient of “better” or “worse.” The gradients of global rating of change assessments typically range from a 15-point scale [5] to a 7-point scale [26], with greater preference for the latter due to a clearer distinction between response options. The PRO change scores of those patients choosing “minimal” or “small” change responses can then be averaged to calculate the MID [5, 11]. Likewise, the average change scores of those selecting the moderate and large change gradients can be used to derive important difference thresholds and to establish a responder definition [26] if responses greater than minimal or small changes are better descriptions of a treatment benefit. It is also important to note that for some health conditions with a known history of progressive deteriorations, no change over time may represent a treatment benefit.

These patient global ratings of change are easily interpreted; however, they have been criticized for relying on patients’ reconstructive memories, which can be poor and result in a systematic underestimation of the initial state and a recall bias for the present state [27]. This is apparent when the change response has a high positive correlation with the end of study measurement and a near-zero correlation with the baseline measurement [28, 29]. Furthermore, the FDA has recently suggested that patient ratings of change are inappropriate for certain diseases, such as irritable bowel syndrome, due to the high level of symptom variability across short periods of time, in addition to the error associated with retrospective assessments over long-time periods. To address these issues, the use of a patient global rating of concept has been suggested by the FDA in 2010 [4]. This method involves asking patients to rate their current state on the concept of interest at each key time point (e.g., “How would you rate your IBS symptoms overall over the past seven days?”). Changes in the global rating of concept across time points (e.g., from baseline to end of treatment) can then be calculated to create responder definitions in much the same way as global ratings of change. However, any investigation of the global rating of concept method should give careful consideration to whether the anchor item: (1) accurately measures the PRO’s concept; (2) includes meaningful and useful response options; and (3) can inform when an important change over time happens from the patients’ perspective.

To appreciate important changes requiring clinical judgment, clinician ratings and direct clinical anchors can be used. Clinician’s global ratings of change employ the same methodology as patient global ratings of change and ask clinicians to rate a patient’s magnitude of change or improvement over time and have been used to identify a PRO responder definition [30]. Clinician ratings of meaningful differences have, however, also been criticized due to the incongruity between patients and clinicians perception of important change [31] and are best applied when patient judgment may be impaired (e.g., mental health conditions).

Direct clinical anchors are thus also commonly used to interpret change in a PRO. These anchors link change in PRO concepts or domains with change in an external criterion. For example, Kosinski et al. [32] assessed change among patients with rheumatoid arthritis on the SF-36 and Health Assessment Questionnaire based on minimal, moderate, and large categorical improvements in joint tenderness and swelling counts, as well as patient and physician global assessments, and a global pain assessment. Eton et al. [33] assessed change in four breast cancer endpoints based on analgesia use, change in Eastern Cooperative Oncology Group (ECOG) performance status, and response to treatment (complete response, partial response, stable disease, and progression).

As explained earlier, to be useful in understanding a PRO’s responder definition, clinical anchors need to be relevant to, and correlated with, the QOL concept under consideration. Thus, the joint tenderness or swelling count anchors used by Kosinski et al. [32] may have been appropriate to interpret change on the SF-36 bodily pain subscale, but perhaps less appropriate for other subscales. Furthermore, to assess change in a PRO score associated with different levels of important improvement (minimal, moderate, and large), clinical assessments that use cross-sectional categorical ratings (e.g., none, mild, moderate, and severe) may be problematic if the clinical category selection is considered subjective and/or inconsistently applied across different clinicians. Even when these rating scale categories are precisely defined, the wide range of patient status captured in each category can make movement from one category difficult, and therefore, unable to detect potentially important treatment benefits.

Distribution-based methods

Distribution-based methods are another set of approaches for estimating the magnitude of meaningful PRO change scores using statistical parameters from the clinical trial population. Although there are a variety of different methods for interpretation based on the statistical distribution, all express a change score difference relative to some measure of variability.

The effect size (ES) is often employed to compare two or more groups to benchmark the magnitude of the group difference. Distribution-based ESs have the advantage of placing mean differences into a unit-less metric, thus allowing comparisons between different phenomena, as well as comparisons between groups in a treatment study. There are well-accepted standards for judging an ES: 0.2 is considered small, 0.5 is medium, and 0.8 or greater is large. These conventions were introduced by Jacob Cohen based on his experiences with education and psychological tests [34], and some empirical evidence exists in the health sciences supporting Cohen’s effect size convention [35]. Nonetheless, Cohen warned that these effect size standards should be used only “in the belief that more is to be gained than lost by supplying a common conventional frame of reference which is recommended for use only when no better basis for estimating the ES index is available.” (p. 25).

There are several different methods for calculating group ES using the mean change score differences in the numerator, and a variety of calculations of the standard deviation in the denominator [36]. Most commonly, ES is the ratio of the group differences in mean change scores to the baseline standard deviation score [37]. Norman et al. [38] noted that the MID for PRO measures was frequently very close to a half standard deviation, or an ES of 0.5. This observation was based on a systematic examination of 38 PRO studies, using different instruments and across different diseases [38].

The standard error of measurement (SEM) has also been a useful distribution-based method for gauging an important PRO change for an individual incorporating a statistical parameter with origins in classical test theory [39]. The SEM is calculated by multiplying the standard deviation by the square root of 1 minus the PRO’s reliability coefficient (SD√(1−rxx) [39]. One SEM has demonstrated a strong correspondence to anchor-based individual change thresholds for many PRO measures [40, 41, 42, 43, 44, 45].

These distribution-based methods for PRO interpretation provide an alternative to anchor-based methods when an appropriate anchor is not available. Their limitations are closely related to the source of their usefulness; that is, because they are not derived using a relevant external criterion, interpretation must be based on prior conventional benchmarks (e.g., small, medium, or large ES, half standard deviation, or 1 SEM). Moreover, the final PRO Guidance states FDA’s view that the interpretation of PRO change be based on anchor-based methods, with distribution-based approaches playing only a supportive role [3].

Cumulative distribution function of responses

To avoid the controversies from the selection of any single point estimate as the best threshold for judging all patients’ changes over time, a cumulative distribution function of responses has been proposed in the final PRO Guidance as a potentially preferable method to depict the effect of treatment across the entire study population by showing all magnitudes of change and the proportion of individuals within a trial achieving each level [3]. Figure 1 displays the cumulative distribution function of responses from one trial used in the original registration of Aricept® (donepezil hydrochloride) 5 mg and 10 mg tablets shown in the product’s label [46]. This 2-dimensional graph includes separate curves for each treatment and placebo group, with the x-axis coordinates conveying PRO change from baseline and y-axis coordinates indicating the cumulative proportion of patients achieving these levels of change. With this diagram, the percentage of patients in each treatment group achieving a spectrum of change thresholds can be easily compared, and therefore, eliminating the need for one specific responder definition to interpret the PRO changes in each group [3]. Due to differing viewpoints within the geriatric community on the best point estimate for the responder definition on the Alzheimer’s Disease Assessment Scale-Cognitive subscale (ADAS-Cog, a PRO used in the depicted trial), the chart in Fig. 1 denotes three different relevant change thresholds of −7, −4, and 0 (no change) with vertical lines, as well as a table within the figure of the percentage of responders in each group at each of these thresholds. It is important to note that a reduction or negative change score on the ADAS-Cog scale represents an improvement in cognition over time. As shown, 14 % (placebo), 21 % (5 mg/day dose), and 36 % (10 mg/day dose) of trial participants achieved at least a 7-point reduction in their ADAS-Cog score, while 72, 83, and 87 %, respectively, had no worsening in this PRO score over the 24 weeks of this trial. Because this display of responses can be more comprehensive than denoting the treatment effect and the percentage of patients achieving the single responder definition, the FDA encourages study sponsors to use cumulative distribution function of responses as part of the evidence supporting the claim for a treatment benefit based on a PRO [3].
Fig. 1

Example of a cumulative distribution function of responses for Aricept® 5 and 10 mg doses compared to placebo [46]. Important change thresholds considered for ADAS-cog score decreases over 24 weeks are 7, 4 and 0 points

Discussion

The December 2009 final FDA PRO Guidance represents a useful step forward in guiding the interpretation of PRO change scores beyond achieving statistical significance to support medical product labeling claims. This document has removed reference to the concept of the minimum important difference (MID) as a primary aid to interpretation of trial results. However, given the role that the MID concept has played in the development and use of PROs over the past two decades, the use of this acronym will certainly live on in discussions and the scientific literature when the interpretation of change over time in PROs is addressed. The reason for the disappearance of the MID term in the Guidance is presumably due to the manner that the MID change threshold was being applied. That is, in the 2006 draft PRO Guidance, the MID was viewed as the minimum difference in mean change from baseline between treatment groups that can be interpreted as an important difference [19]. This is also often referred to as the between-group difference [47]. The inconsistency of using a single change threshold derived from data designed to estimate important individual changes to inform the required magnitude of difference in group changes will hopefully end with the elimination of the MID terminology in the final PRO Guidance. The FDA has named the responder definition as the appropriate individual or within-subject change threshold, while at the same time, recognizing that selecting a specific level for a meaningful response to treatment can be quite subjective. Therefore, the final FDA PRO Guidance also recommends displaying PRO change results using a cumulative distribution curve of responses, where the percentage of patients in each study group achieving a spectrum of change thresholds can be easily compared [3]. It is important to note that although a thorough discussion is outside the scope of this article, response shift [48, 49, 50] and the use of proxy respondents or proxy measurements [51, 52, 53] are also threats to the PRO interpretation process described, and a limitation to this article’s focus.

The authors of this article support the modification in interpretation mentioned above. We agree that anchor-based methods for finding the responder definition best describe the estimated change of an individual experiencing a treatment benefit, not a groups’ change over time. We also support the use of the cumulative response distribution curves so that the percentage of clinical trial subjects achieving all possible change levels can be easily displayed in a single diagram. However, we do not see the cumulative distribution functions as a replacement for the careful investigation of a PRO’s relevant responder threshold using anchor-based methods, and supported by distribution-based methods. Identifying the best estimate for the level where individual patients demonstrate a meaningful treatment benefit provides important information to patients, physicians, payors, and policy makers. As more studies are accumulated, triangulation across results may be needed to obtain the best estimate of the responder definition and stakeholder comfort in this threshold. The FDA’s 2010 request for cross-sectional patient global assessments of concept [4] versus patient global assessments of change as patient-reported anchors has challenged the long-time approach exemplified by Jaeschke et al. [5].

We also recognize the usefulness of identifying the minimum difference in mean change from baseline between treatment groups that can be interpreted as an important difference (MID). Clinical researchers and others in the medical product development process continue to turn to PRO specialists for this estimate to plan clinical trials, where the MID influences sample size calculations. Although the MID point estimate is no longer required for the interpretation of PRO results and incorporation into the FDA PRO dossier, it remains an important threshold that also deserves careful consideration in planning all clinical trials that include PRO assessments.

Notes

Acknowledgments

Members of the Industry Advisory Committee (IAC), the Board of Directors of the International Society for Quality of Life Research (ISOQOL), and two anonymous reviewers offered valuable suggestions that were incorporated into this paper.

References

  1. 1.
    Patient-Centered Outcomes Research Institute (PCORI). Available at: http://www.pcori.org/home.html.
  2. 2.
    King, M. T. (2011). A point of minimal important difference (MID): A critique of terminology and methods. Expert review of pharmacoeconomics & outcomes research, 11(2), 171–184.CrossRefGoogle Scholar
  3. 3.
    Food and Drug Administration. (2009). Guidance for industry on patient-reported outcome measures: Use in medical product development to support labeling claims. Federal Register, 74(235), 65132–65133.Google Scholar
  4. 4.
    Burke, L. B., & Trenacosti, A. M. (2010). Interpretation of PRO trial results to support FDA labelling claims: the regulator perspective. International Society for Pharmacoecomomics and Outcomes Research 15th Annual International Meeting. Atlanta: GA.Google Scholar
  5. 5.
    Jaeschke, R., Singer, J., & Guyatt, G. H. (1989). Measurement of health status. Ascertaining the minimal clinically important difference. Controlled Clinical Trials, 10(4), 407–415.PubMedCrossRefGoogle Scholar
  6. 6.
    Guyatt, G. H., Berman, L. B., & Townsend, M. (1987). Long-term outcome after respiratory rehabilitation. Canadian Medical Association Journal, 137(12), 1089–1095.PubMedGoogle Scholar
  7. 7.
    Guyatt, G. H., Townsend, M., Nogradi, S., Pugsley, S. O., Keller, J. L., & Newhouse, M. T. (1988). Acute response to bronchodilator. An imperfect guide for bronchodilator therapy in chronic airflow limitation. Archives of Internal Medicine, 148(9), 1949–1952.PubMedCrossRefGoogle Scholar
  8. 8.
    Guyatt, G. H., Sullivan, M. J., Fallen, E. L., Tihal, H., Rideout, E., Halcrow, S., et al. (1988). A controlled trial of digoxin in congestive heart failure. American Journal of Cardiology, 61(4), 371–375.PubMedCrossRefGoogle Scholar
  9. 9.
    Guyatt, G. H., Berman, L. B., Townsend, M., Pugsley, S. O., & Chambers, L. W. (1987). A measure of quality of life for clinical trials in chronic lung disease. Thorax, 42(10), 773–778.PubMedCrossRefGoogle Scholar
  10. 10.
    Guyatt, G. H., Nogradi, S., Halcrow, S., Singer, J., Sullivan, M. J., & Fallen, E. L. (1989). Development and testing of a new measure of health status for clinical trials in heart failure. Journal of General Internal Medicine, 4(2), 101–107.PubMedCrossRefGoogle Scholar
  11. 11.
    Juniper, E. F., Guyatt, G. H., Willan, A., & Griffith, L. E. (1994). Determining a minimal important change in a disease-specific Quality of Life Questionnaire. Journal of Clinical Epidemiology, 47(1), 81–87.PubMedCrossRefGoogle Scholar
  12. 12.
    Sloan, J. A., Cella, D., Frost, M., Guyatt, G. H., Sprangers, M., & Symonds, T. (2002). Assessing clinical significance in measuring oncology patient quality of life: Introduction to the symposium, content overview, and definition of terms. Mayo Clinic Proceedings, 77(4), 367–370.PubMedCrossRefGoogle Scholar
  13. 13.
    Guyatt, G. H., Osoba, D., Wu, A. W., Wyrwich, K. W., & Norman, G. R. (2002). Methods to explain the clinical significance of health status measures. Mayo Clinic Proceedings, 77(4), 371–383.PubMedCrossRefGoogle Scholar
  14. 14.
    Cella, D., Bullinger, M., Scott, C., & Barofsky, I. (2002). Group vs individual approaches to understanding the clinical significance of differences or changes in quality of life. Mayo Clinic Proceedings, 77(4), 384–392.PubMedCrossRefGoogle Scholar
  15. 15.
    Sloan, J. A., Aaronson, N., Cappelleri, J. C., Fairclough, D. L., & Varricchio, C. (2002). Assessing the clinical significance of single items relative to summated scores. Mayo Clinic Proceedings, 77(5), 479–487.PubMedGoogle Scholar
  16. 16.
    Frost, M. H., Bonomi, A. E., Ferrans, C. E., Wong, G. Y., & Hays, R. D. (2002). Patient, clinician, and population perspectives on determining the clinical significance of quality-of-life scores. Mayo Clinic Proceedings, 77(5), 488–494.PubMedGoogle Scholar
  17. 17.
    Sprangers, M. A., Moinpour, C. M., Moynihan, T. J., Patrick, D. L., & Revicki, D. A. (2002). Assessing meaningful change in quality of life over time: A users’ guide for clinicians. Mayo Clinic Proceedings, 77(6), 561–571.PubMedCrossRefGoogle Scholar
  18. 18.
    Symonds, T., Berzon, R., Marquis, P., & Rummans, T. A. (2002). The clinical significance of quality-of-life results: Practical considerations for specific audiences. Mayo Clinic Proceedings, 77(6), 572–583.PubMedCrossRefGoogle Scholar
  19. 19.
    Food and Drug Administration. (2006). Draft guidance for industry on patient-reported outcome measures: Use in medical product development to support labeling claims. Federal Register, 71(23), 5862–5863.Google Scholar
  20. 20.
    Hamilton, M. (1967). Development of a rating scale for primary depressive illness. The British Journal of Social and Clinical Psychology, 6(4), 278–296.PubMedCrossRefGoogle Scholar
  21. 21.
    Revicki, D. A., Erickson, P. A., Sloan, J. A., Dueck, A., Guess, H., & Santanello, N. C. (2007). Interpreting and reporting results based on patient-reported outcomes. Value Health, 10(Suppl 2), S116–S124.PubMedCrossRefGoogle Scholar
  22. 22.
    Patrick, D. L., Burke, L. B., Powers, J. H., Scott, J. A., Rock, E. P., Dawisha, S., et al. (2007). Patient-reported outcomes to support medical product labeling claims: FDA perspective. Value Health, 10(Suppl 2), S125–S137.PubMedCrossRefGoogle Scholar
  23. 23.
    Lydick, E., & Epstein, R. S. (1993). Interpretation of quality of life changes. Quality of Life Research, 2(3), 221–226.PubMedCrossRefGoogle Scholar
  24. 24.
    Revicki, D., Hays, R. D., Cella, D., & Sloan, J. (2008). Recommended methods for determining responsiveness and minimally important differences for patient-reported outcomes. Journal of Clinical Epidemiology, 61(2), 102–109.PubMedCrossRefGoogle Scholar
  25. 25.
    Sloan, J. A., Frost, M. H., Berzon, R., Dueck, A., Guyatt, G., Moinpour, C., et al. (2006). The clinical significance of quality of life assessments in oncology: A summary for clinicians. Supportive Care in Cancer, 14(10), 988–998.PubMedCrossRefGoogle Scholar
  26. 26.
    Farrar, J. T., Young, J. P., Jr, LaMoreaux, L., Werth, J. L., & Poole, R. M. (2001). Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain, 94(2), 149–158.PubMedCrossRefGoogle Scholar
  27. 27.
    Norman, G. R., Stratford, P., & Regehr, G. (1997). Methodological problems in the retrospective computation of responsiveness to change: The lesson of Cronbach. Journal of Clinical Epidemiology, 50(8), 869–879.PubMedCrossRefGoogle Scholar
  28. 28.
    Walters, S. J., & Brazier, J. E. (2005). Comparison of the minimally important difference for two health state utility measures: EQ-5D and SF-6D. Quality of Life Research, 14(6), 1523–1532.PubMedCrossRefGoogle Scholar
  29. 29.
    Metz, S. M., Wyrwich, K. W., Babu, A. N., Kroenke, K., Tierney, W. M., & Wolinsky, F. D. (2007). Validity of patient-reported health-related quality of life global ratings of change using structural equation modeling. Quality of Life Research, 16(7), 1193–1202.PubMedCrossRefGoogle Scholar
  30. 30.
    Wyrwich, K., Harnam, N., Revicki, D. A., Locklear, J. C., Svedsater, H., & Endicott, J. (2009). Assessing health-related quality of life in generalized anxiety disorder using the Quality Of Life Enjoyment and Satisfaction Questionnaire. International Clinical Psychopharmacology, 24(6), 289–295.PubMedCrossRefGoogle Scholar
  31. 31.
    Brozek, J. L., Guyatt, G. H., & Schunemann, H. J. (2006). How a well-grounded minimal important difference can enhance transparency of labelling claims and improve interpretation of a patient reported outcome measure. Health and Quality of Life Outcomes, 4, 69.PubMedCrossRefGoogle Scholar
  32. 32.
    Kosinski, M., Zhao, S. Z., Dedhiya, S., Osterhaus, J. T., & Ware, J. E., Jr. (2000). Determining minimally important changes in generic and disease-specific health-related quality of life questionnaires in clinical trials of rheumatoid arthritis. Arthritis and Rheumatism, 43(7), 1478–1487.PubMedCrossRefGoogle Scholar
  33. 33.
    Eton, D. T., Cella, D., Yost, K. J., Yount, S. E., Peterman, A. H., Neuberg, D. S., et al. (2004). A combination of distribution- and anchor-based approaches determined minimally important differences (MIDs) for four endpoints in a breast cancer scale. Journal of Clinical Epidemiology, 57(9), 898–910.PubMedCrossRefGoogle Scholar
  34. 34.
    Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  35. 35.
    Kazis, L. E., Anderson, J. J., & Meenan, R. F. (1989). Effect sizes for interpreting changes in health status. Medical Care, 27(3 Suppl), S178–S189.PubMedCrossRefGoogle Scholar
  36. 36.
    Norman, G. R., Wyrwich, K. W., & Patrick, D. L. (2007). The mathematical relationship among different forms of responsiveness coefficients. Quality of Life Research, 16(5), 815–822.PubMedCrossRefGoogle Scholar
  37. 37.
    Liang, M. H. (1995). Evaluating measurement responsiveness. Journal of Rheumatology, 22(6), 1191–1192.PubMedGoogle Scholar
  38. 38.
    Norman, G. R., Sloan, J. A., & Wyrwich, K. W. (2003). Interpretation of changes in health-related quality of life: The remarkable universality of half a standard deviation. Medical Care, 41(5), 582–592.PubMedGoogle Scholar
  39. 39.
    Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric Theory. New York: McGraw-Hill.Google Scholar
  40. 40.
    Wyrwich, K. W., Tierney, W. M., & Wolinsky, F. D. (1999). Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health-related quality of life. Journal of Clinical Epidemiology, 52(9), 861–873.PubMedCrossRefGoogle Scholar
  41. 41.
    Wyrwich, K. W. (2004). Minimal important difference thresholds and the standard error of measurement: Is there a connection? Journal of Biopharmaceutical Statistics, 14(1), 97–110.PubMedCrossRefGoogle Scholar
  42. 42.
    Wyrwich, K. W., Tierney, W. M., & Wolinsky, F. D. (2002). Using the standard error of measurement to identify important changes on the Asthma Quality of Life Questionnaire. Quality of Life Research, 11(1), 1–7.PubMedCrossRefGoogle Scholar
  43. 43.
    Cella, D., Eton, D. T., Fairclough, D. L., Bonomi, P., Heyes, A. E., Silberman, C., et al. (2002). What is a clinically meaningful change on the Functional Assessment of Cancer Therapy-Lung (FACT-L) Questionnaire? Results from Eastern Cooperative Oncology Group (ECOG) Study 5592. Journal of Clinical Epidemiology, 55(3), 285–295.PubMedCrossRefGoogle Scholar
  44. 44.
    Crosby, R. D., Kolotkin, R. L., & Williams, G. R. (2004). An integrated method to determine meaningful changes in health-related quality of life. Journal of Clinical Epidemiology, 57(11), 1153–1160.PubMedCrossRefGoogle Scholar
  45. 45.
    Yost, K. J., Cella, D., Chawla, A., Holmgren, E., Eton, D. T., Ayanian, J. Z., et al. (2005). Minimally important differences were estimated for the Functional Assessment of Cancer Therapy-Colorectal (FACT-C) instrument using a combination of distribution- and anchor-based approaches. Journal of Clinical Epidemiology, 58(12), 1241–1251.PubMedCrossRefGoogle Scholar
  46. 46.
    ARICEPT Oral Solution (Donepezil Hydrochloride) [approval label]. Available at: http://www.accessdata.fda.gov/drugsatfda_docs/label/2004/21719lbl.pdf.
  47. 47.
    Copay, A. G., Subach, B. R., Glassman, S. D., Polly, D. W., Jr, & Schuler, T. C. (2007). Understanding the minimum clinically important difference: A review of concepts and methods. Spine Journal, 7(5), 541–546.PubMedCrossRefGoogle Scholar
  48. 48.
    Sprangers, M. A., & Schwartz, C. E. (1999). Integrating response shift into health-related quality of life research: A theoretical model. Social Science and Medicine, 48(11), 1507–1515.PubMedCrossRefGoogle Scholar
  49. 49.
    Rapkin, B. D., & Schwartz, C. E. (2004). Toward a theoretical model of quality-of-life appraisal: Implications of findings from studies of response shift. Health and Quality of Life Outcomes, 2, 14.PubMedCrossRefGoogle Scholar
  50. 50.
    Barclay-Goddard, R., Epstein, J. D., & Mayo, N. E. (2009). Response shift: A brief overview and proposed research priorities. Quality of Life Research, 18(3), 335–346.PubMedCrossRefGoogle Scholar
  51. 51.
    Sprangers, M. A., & Aaronson, N. K. (1992). The role of health care providers and significant others in evaluating the quality of life of patients with chronic disease: A review. Journal of Clinical Epidemiology, 45(7), 743–760.PubMedCrossRefGoogle Scholar
  52. 52.
    von Essen, L. (2004). Proxy ratings of patient quality of life–factors related to patient-proxy agreement. Acta Oncologica, 43(3), 229–234.CrossRefGoogle Scholar
  53. 53.
    van der Linden, F. A., Kragt, J. J., van Bon, M., Klein, M., Thompson, A. J., van der Ploeg, H. M., et al. (2008). Longitudinal proxy measurements in multiple sclerosis: Patient-proxy agreement on the impact of MS on daily life over a period of two years. BMC Neurol, 8, 2.PubMedCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  • K. W. Wyrwich
    • 1
  • J. M. Norquist
    • 2
  • W. R. Lenderking
    • 1
  • S. Acaster
    • 3
  • the Industry Advisory Committee of International Society for Quality of Life Research (ISOQOL)
  1. 1.United BioSource CorporationBethesdaUSA
  2. 2.Merck Sharp & Dohme, Inc.North WalesUSA
  3. 3.Oxford Outcomes LtdOxfordUK

Personalised recommendations