Plain English Summary

When testing patient-reported outcome (PRO) measures, the ability of the measure to produce the same score over time in respondents with stable health is important. A good PRO measure should give you the same score when someone feels the same and should give you a different score when they feel different. The ability of a measure to produce similar scores is assessed by a statistical test called “test-retest reliability.” This means when you give the same test over again to a group of people who feel the same as they did before, you should get the same score. Sometimes it can be difficult to identify which people feel the same, especially if you have a disease or condition where the symptoms can change a lot. The research team looked at two different ways to identify whether people’s health has changed: a patient global impression of severity (PGIS), where patients answer how they feel at that time, and a patient global impression of change (PGIC), where patients answer if they feel different from another time in the past. In order to see which measure (PGIS or PGIC) works better, the researchers used data that was collected in some other studies. They found that the PGIS does a better job finding patients who feel the same each time and recommend using it when trying to see if a PRO measure can produce consistent results over time. Even though the PGIS worked better, the researchers still suggest also using the PGIC when possible.

Introduction

Critical Path Institute (C-Path) established the Patient-Reported Outcome (PRO) Consortium in 2008 [1]. As part of supporting the qualification of PRO measures and other clinical outcome assessments (COAs), one of its objectives is to advance the science underpinning the fit-for-purpose assessment of clinical outcomes in trials.

Hence, while the PRO Consortium’s working groups are generating evidence required for the qualification of their respective COAs, additional steps are being taken to gather data that can inform measurement and/or methodological questions. One such question is whether there is an empirical basis for using either static measures of current state (e.g., patient global impression of severity [PGIS]) or retrospective measures of change (e.g., patient global impression of change [PGIC]), to identify stable participants for evaluation of reproducibility of PRO measures.

FDA’s guidance for PRO measures [2] emphasizes the importance of demonstrating reproducibility of scores as one of the measurement properties of a fit-for-purpose PRO measure, commonly assessed by test–retest reliability. In addition, the FDA COA Qualification Program requires documentation of test–retest reliability to support qualification submissions [3]. The purpose of test–retest reliability is to assess the stability of scores over time, i.e., reproducibility. Because scores are expected to change over time due to the participant’s underlying condition changing with an effective intervention, it is necessary to evaluate test–retest reliability with a group of stable participants where no change in condition is observed or reported. One approach is to evaluate test–retest reliability between 2 very close timepoints, and another approach is to use data from the placebo group or from a non-interventional study where change is unlikely to happen. However, each approach has drawbacks. Too short a time interval is subject to “memory effects” where participants remember their responses at timepoint 1 and simply give those same responses at timepoint 2. Whereas in the placebo group or non-intervention approach, participants may still experience change in their condition even without active treatment, especially in conditions where the symptoms are variable. Similar concerns were raised by Reeve and colleagues [4], specifically for multi-item measures. However, in addition to being required for FDA’s COA qualification process, we believe that assessing test–retest reliability provides value when conducted under sufficiently optimal circumstances.

To overcome these drawbacks, selection of the time interval between test and retest should take into consideration symptom variability such that it is long enough to reduce memory effects but short enough that there are a sufficient number of participants whose condition remains stable. Furthermore, to ensure that only stable participants are included in the evaluation of test–retest reliability, anchor measures should be used to identify participants whose condition has not changed during that interval. Anchor measures should assess concepts that are the same or closely related to the concept of interest of the PRO measure in question. Often, patient global assessments are used, especially if there are no other suitable anchor measures.

Two widely used patient global assessments are variations of the PGIC and the PGIS. A PGIC is administered at the end of the test–retest period and asks respondents to rate their symptom severity now/today compared to the beginning of the test–retest period [5]. PGIC response options typically use a bi-directional response scale, e.g., 5 to 7 options ranging from “Much better” to “Much worse” with “No change” at the midpoint. Respondents selecting “No change” form the stable/unchanged subgroup used for test–retest analysis. In contrast, a PGIS asks respondents to rate their symptom or disease severity at a given time, and it is administered at multiple timepoints, including the beginning and the end of the test–retest period. Response options, in the form of a unidirectional verbal rating scale (VRS), often range from “None” to “Very severe.” The stable subgroup used for test–retest analysis is then identified as those respondents providing the same severity rating at those timepoints used in the test–retest analysis. However, there is no consensus or empirical data on which anchor measure performs better in identifying a stable subgroup of participants for the purpose of assessing test–retest reliability. While others have examined the suitability of PGIS and PGIC in other contexts, the authors are not aware of prior studies which compare these anchors specifically within the context of test–retest analysis.

To provide empirical evidence comparing the PGIC and PGIS for identifying a stable subgroup for test–retest analyses, we evaluated data from 3 different working groups within the PRO Consortium, which collected data using both anchors. The Depression Working Group, Non-Small Cell Lung Cancer (NSCLC) Working Group, and Asthma Working Group each conducted independent quantitative pilot studies for developmental versions of their novel PRO measures. The objective of this manuscript is to examine the performance of the PGIS and PGIC in these 3 studies in evaluating test–retest reliability and provide recommendations on their use in examining the stability of a PRO measure’s responses during psychometric evaluation.

Methods

This section briefly describes methods used to compare the PGIS and PGIC as part of the quantitative pilot studies conducted by each working group during the development of their respective PRO measures.

Depression study

The Symptoms of Major Depressive Disorder Scale (SMDDS) is a 16-item measure developed by the Depression Working Group for assessing symptoms of major depressive disorder (MDD) in adults using a 7-day recall period. Each item has a 5-level VRS of either “Not at all/A little bit/Moderately/Quite a bit/Extremely” or “Never/Rarely/Sometimes/Often/Always” depending on whether the item assesses intensity or frequency. Evidence supporting its unidimensionality has been published [6].

The quantitative pilot study included a non-randomized purposive sampling target of 200 participants with clinician-diagnosed MDD recruited from 12 clinical sites within the United States (US) [6]. Participants provided consent via a website and completed a series of web-based measures using a personal computer outside the clinic setting at 2 different timepoints. On Day 1, participants completed the SMDDS, a single-item PGIS, and demographic information. Between Days 7 and 10 (hereafter denoted as Day 8), participants completed the same set of measures as Day 1, plus an additional single-item PGIC.

The PGIS asked the following: “How would you rate your depression at this time?” with response options of: “Not depressed,” “Mildly depressed,” “Moderately depressed,” “Very depressed,” and “Extremely depressed.” Positive PGIS change (1, 2, 3) indicates improvement, while negative PGIS change (−1, −2, −3) indicates worsening, in all 3 studies. The PGIC asked participants “Compared to seven days ago, would you describe your depression as…” with the following 7 response options: “Much better,” “Better,” “A little better,” “No change,” “A little worse,” “Worse,” and “Much worse.”

NSCLC study

The Non-Small Cell Lung Cancer Symptom Assessment Questionnaire (NSCLC-SAQ) is a 7-item assessment of symptom severity in adults with NSCLC that covers 5 domains (i.e., cough, pain, dyspnea, fatigue, and appetite) and has a 7-day recall period. Each item has a 5-level VRS from either “No < symptom > at All” to “Very severe < symptom > ” or from “Never” to “Always,” depending on whether the item assesses intensity or frequency. Evidence supporting its unidimensionality has been published [7].

The quantitative pilot study included a non-randomized purposive sampling target of 150 participants with clinician-diagnosed NSCLC recruited from 14 US clinical sites [7]. Participants provided consent at the initial clinic visit and completed measures on a touchscreen-enabled tablet computer at their clinic sites at Day 1 and again 7 to 10 days later.

On Day 1, each participant completed the NSCLC-SAQ, a single-item PGIS, and demographic information. On Day 8 (accepted within Day 7–10 window), participants returned to the clinic to complete the same measures as Day 1, plus a single-item PGIC.

The PGIS asked the following: “How would you rate your symptoms of your lung cancer at this time?” with response options of: “Not severe,” “Mildly severe,” “Moderately severe,” “Very severe,” and “Extremely severe.” The PGIC asked the following: “Compared to your first study visit, would you describe the symptoms of your lung cancer today as: Much better, Better, A little better, No change, A little worse, Worse, Much worse.”

Asthma study

The Asthma Daytime Symptom Diary (ADSD) and the Asthma Nighttime Symptom Diary (ANSD) are daily assessments of asthma symptom severity appropriate for use in adults and adolescents. The 6-item ADSD asks participants to rate each asthma symptom “at its worst since you got up this morning.” The 6-item ANSD includes the same symptoms with the timeframe “since you went to bed last night.” Evidence supporting their unidimensionality has been published [8].

The quantitative pilot study was a multi-center, observational study in which PRO data were collected directly from a sampling target of 200 people with asthma recruited from 11 US sites [8]. Participants consented and completed the ADSD and ANSD daily over 10 days on a provisioned smartphone.

Participants completed the PGIS twice on each study day, once in the morning and once in the evening. The PGIS asked the following: “Overall, please rate your asthma symptoms since you went to bed last night” (ANSD) and “Overall please rate your asthma symptoms since you got up this morning” (ADSD). The response scale was a 0 to 10 numeric rating scale (NRS) of 0 = “no symptoms” to 10 = “symptoms as bad as you can imagine.”

Participants were asked to complete the PGIC only on Day 10 after they had completed all other study measures. The PGIC asked the following: “Compared to seven days ago, would you describe your asthma symptoms today as: Much better, Better, A little better, No change, A little worse, Worse, Much worse.”

Test–retest reliability samples

In each case, we provide results for the primary analyses of participants who were designated as stable based on no change in status according to either PGIC or PGIS, as well as exploratory analyses of the full sample of study participants and the subgroup who had no change on both the PGIS and PGIC. These exploratory analyses provide context for comparing the performance of the subgroups identified with the PGIS or PGIC against the unrestricted full sample as well as the most conservative sample. We hypothesized that the PGIS would perform better than the PGIC in these 3 samples because of concerns that recall bias in the case of the retrospective PGIC would result in weaker correlations between change reported on the anchor compared to actual change on the measure itself. Two criteria were used to judge which approach is better: first, the approach needed to yield an ICC of 0.70 or greater, and second, in cases where both approaches exceeded this threshold, the ICCs were compared in terms of which was higher. We also hypothesized that the most conservative sample would yield the highest test–retest reliability as we have more confidence that this subgroup is stable as no change was reported for both anchors. Finally, we hypothesized that the full sample would have the lowest test–retest reliability as it includes participants who have changed between the two timepoints.

For the depression and NSCLC studies, test–retest reliability analyses were conducted using Day 1 and Day 8 data restricted to the subgroup of participants whose depression or NSCLC symptoms remained stable during the study period as defined by the same PGIS responses between Day 1 and Day 8 or “No change” response to the Day 8 PGIC.

For the asthma study, test–retest reliability was evaluated among participants whose experience of asthma symptoms was defined as “stable” between Day 3 and Day 10 defined as:

  1. a.

    Participants completing the ADSD and ANSD at Day 3 and Day 10 with the same daytime PGIS response and same nighttime PGIS response at both timepoints, or

  2. b.

    Participants completing the ADSD and ANSD at Day 3 and Day 10 reporting “No change” on the PGIC at Day 10.

Analyses

To explore within-participant concordance for each study, a cross-tabulation was conducted between responses to the PGIC and level of change on the PGIS.

Test–retest reliability was assessed using a two-way mixed-effect analysis of variance model with interaction for the absolute agreement between single scores, which is the ICC model recommended for test–retest analyses by Qin et al. [9] based on Shrout and Fleiss [10] and McGraw and Wong [11]. ICCs range from 0 to 1, with an ICC ≥ 0.7 indicating good test–retest reliability [12]. ICCs were computed using SPSS [13] in the depression and NSCLC studies and using SAS version 9.4 [14] in the asthma study.

Results

Depression study

Sample sociodemographic characteristics

A total of 207 participants were enrolled, with 147 participants completing the retest measures. Table 1 shows test and retest participants’ sociodemographic characteristics, with minor differences between these groups. Retest participants were 45 years old on average, 70.7% female, 81% White, and 25% Hispanic/Latino. Almost all (95%) had a minimum high school education, and 59% were employed.

Table 1 Participants’ sociodemographic characteristics

Table 2 shows the cross-tabulation of participant responses to the PGIC (Day 8) and the level of change between PGIS Day 1 to Day 8. As indicated by the shading, 48 participants (33%) had no change on both assessments with another 53 participants (36%) having no change on one measure and a 1 category change on the other. A greater number had no change in PGIS (n = 93, 64%) compared to PGIC (n = 74, 51%).

Table 2 Cross-tabulation of PGIS versus PGIC using SMDDS sample

Test–retest reliability

An analysis comparing the PGIS and PGIC was performed by examining the ICC values based on the 3 stable subgroup definitions and the full sample. As shown in Table 3, the ICC of the SMDDS score is higher for the PGIS “no change” subgroup (0.84) than it is for the PGIC “no change” subgroup (0.74), although both exceed the recommended threshold of 0.70. Comparisons against the full sample and the most conservative subgroup showed that the PGIS subgroup was almost the same as the most conservative subgroup, while the PGIC subgroup was slightly below the full sample. In this case, the hypothesis that the PGIS subgroup would have a higher ICC than the PGIC subgroup was confirmed, but the full sample and the most conservative subgroup did not perform as hypothesized. The ICCs were similar between the PGIC subgroup and the full sample, and between the PGIS subgroup and the most conservative subgroup.

Table 3 Comparison of PGIS vs. PGIC for SMDDS

NSCLC study

Sample sociodemographic characteristics

A total of 152 participants from 14 US sites were enrolled, with 148 participants completing the retest measures. Retest participants were 64 years old on average, 56% female, 87% White, and 5% Hispanic/Latino (Table 1). The majority (84%) had a minimum high school education, 50% were retired, and 17% were unable to work. No differences were noted between test and retest participants.

Table 4 shows the cross-tabulation of NSCLC participants and their responses to the PGIC (Day 8) and the level of change between PGIS Day 1 to Day 8. As indicated by the shading, 56 participants (38%) had no change on both assessments with another 42 participants (28%) deviating by 1 point on either assessment.

Table 4 Cross-tabulation of PGIS versus PGIC using NSCLC-SAQ sample

Test–retest reliability

An analysis comparing the PGIS and PGIC using the NSCLC-SAQ sample was performed by examining the ICC values based on the 3 stable subgroup definitions and the full sample. As shown in Table 5, the ICC of the NSCLC-SAQ score is slightly higher for the PGIS “no change” subgroup (0.87) than the PGIC “no change” subgroup (0.85), although both exceed the recommended threshold of 0.70. Comparisons against the full sample and the most conservative subgroup showed that the ICC values were as hypothesized. The most conservative sample had the highest ICC, followed by PGIS and PGIC subgroups; the full sample had the lowest ICC.

Table 5 Comparison of PGIS vs. PGIC for NSCLC-SAQ

Asthma study

Sample sociodemographic characteristics

A total of 219 participants were recruited from 13 US sites, with 170 completing the ADSD on both assessment days, 180 completing the ANSD on both assessment days, 168 completing the PGIS nighttime on both assessment days, 166 completing the PGIS daytime on both assessment days, and 180 completing the PGIC on Day 10. Test and retest participants’ sociodemographic characteristics are shown in Table 1, with only minor differences found. Participants were 25 years old on average, 54% female, 41% White, and 35% Hispanic/Latino. The majority of adolescent participants were in 7th (22.9%) or 8th grade (23.8%). The majority of adults (89%) had a minimum high school education, and 67% were working full-time (Table 1).

Table 6 shows the cross-tabulation of participants and their responses to the PGIC (Day 10) and the level of change between PGIS between Day 3 and Day 10 for the ANSD. As indicated by the shading, only 14 participants (9.5%) had no change on both assessments, with another 38 participants (25.7%) deviating by 1 point on either assessment. Similar analysis for the PGIS and PGIC on the ADSD indicated 25 participants (15.1%) had no change on both assessments, with another 28 participants (17.0%) deviating by 1 point on either assessment (Table 7).

Table 6 Cross-tabulation of PGIS change between Day 3 and Day 10 versus PGIC ratings for the ANSD
Table 7 Cross-tabulation of PGIS change between Day 3 and Day 10 versus PGIC ratings for the ADSD

Test–retest reliability

An analysis comparing the PGIS and PGIC using the ANSD and ADSD samples was performed by examining the ICC values based on the 3 stable group definitions and the full sample. Scores for both the ANSD (Table 8) and ADSD (Table 9) demonstrated “good” test–retest reliability (ICC = 0.95 and 0.86, respectively) when participants were defined as “stable” between Day 3 and 10 according to PGIS ratings. When participants were defined as “stable” according to PGIC ratings, lower ICCs indicative of “moderate” test–retest reliability (ICC = 0.53 and 0.68, respectively) were observed. For the ANSD, the full sample was higher than the PGIC subgroup, which was not as hypothesized, while for the ADSD, the PGIC subgroup had higher ICCs than the full sample (all under 0.70). The PGIS subgroup ICC was almost as high as the most conservative subgroup for the ANDS and slightly higher than the full sample for the ADSD (all greater than 0.8).

Table 8 Comparison of PGIS vs. PGIC for ANSD
Table 9 Comparison of PGIS vs. PGIC for ADSD

Discussion

Results from all 3 case examples indicate that acceptable ICCs were obtained using the PGIS to identify a stable subgroup for evaluating the test–retest reliability of a PRO measure. The PGIS performed better in the depression and asthma studies, while both PGIS and PGIC performed similarly in the NSCLC study. Only the PGIC for the asthma study resulted in an ICC below 0.70. These results suggest that when the symptoms are generally stable across time, both PGIS- and PGIC-based subgroups may yield similar ICC values, whereas when symptoms fluctuate from day-to-day, PGIS is more likely to identify stable participants and yield higher ICCs. This observation is also supported by the ICC results using the full sample and the most conservative subgroup. While the ICCs of the conservative subgroups were always higher than those for the full samples as hypothesized, the difference was greatest for the asthma study where symptom severity was more likely to vary from day-to-day. While the estimated test–retest reliability for the full sample appeared to be numerically higher than the PGIC subgroups which was not as hypothesized, the important message is that they were not within the commonly acceptable range as was the test–retest reliability based on PGIS and the most conservative group.

Overall, these comparisons show that both the PGIS and PGIC were able to identify a stable subgroup for test–retest analyses in 2 of the 3 cases, and researchers may want to include both anchors to generate sufficient evidence to support stability of the subgroup for analysis. Researchers may even consider using the most conservative subgroup to calculate test–retest reliability (i.e., participants identified as stable based on both PGIS and PGIC). However, such a stringent definition may result in a small sample size insufficient for a robust test–retest analysis as was the case for ADSD/ANSD (n = 14 and 25, respectively). In the 3 studies examined in this paper, it appears that using PGIS alone is sufficient to yield results similar to the most conservative sample.

The results observed in the 3 studies may also be affected by the type of global rating scales used. Each study used a similar bi-directional response scale for the PGIC, but the wording of the PGIS rating scale varied. Both the depression and NSCLC studies employed a 5-level VRS, whereas the asthma study employed a 0 to 10 NRS. It is possible that the differences in ICCs among the 3 studies are due to the interaction of the symptom characteristics and the types of rating scales used for the global anchors. The additional response options for the PGIS scale in the asthma study may have created more variability in responses and subsequently noise in the results than the 5-level VRS. We recommend that researchers, when cognitively evaluating a newly developed or existing measure, also consider including proposed anchor measures to ensure that the anchor measures are optimally worded and meaningful to participants, if feasible. At a minimum, we recommend that the anchor measures assess the same or sufficiently similar concepts and have the same recall period as the PRO measure in question.

In the studies reported above, an observational design was used. However, even in an observational design it may be challenging to identify stable participants, because medication use is not controlled. While more than half the participants in the depression and NSCLC studies were considered stable, only a third of the participants in the asthma study were considered stable. This may also depend on the variability of the symptoms included in the measure. Researchers should take this into account when considering the sample size for their studies.

There are a number of limitations associated with our study. This article presents results of secondary analyses of data initially collected for another purpose, and therefore the studies may be underpowered for the comparisons made here. Daily mean scores on the ADSD/ANSD were compared at the test and retest timepoints, rather than comparing weekly mean scores for each measure. With high symptom variability in asthma where symptoms may fluctuate from day-to-day, even those with stable asthma overall may have different symptom scores on days a week apart. The asthma study used a PGIC with a 7-day recall period, while the scores being compared were the Day 3 and Day 10 scores, which deviates from the recommendation to align the recall period of the anchor measure with the measure in question. The asthma study included adolescents and adults, who may approach global ratings differently resulting in more noise in the data. In addition, because participants were recruited from clinical sites, participants could not be readily replaced if they failed to complete their retest assessment, and we lack information on why they dropped out and whether it was random or not. Therefore, we have no way to evaluate whether dropouts impacted the ICC results. PGIS and PGIC wording and scale formats were inconsistent across the 3 studies reported here, which limits our ability to compare their performance. Finally, this study did not assess test–retest reliability using timepoints more than 7 days apart. There are situations where test–retest analysis timepoints may be less or more than 7 days. How these anchors perform in different time intervals needs to be studied further.

For researchers planning to assess test–retest reliability in observational studies, we offer the following recommendations:

  1. 1.

    Include both the PGIS and PGIC as anchors. While the PGIS performed better in our studies overall, the PGIC can be used to identify a complementary stable subgroup for confirming test–retest reliability. In addition, researchers should also consider the use of other measures, including clinician assessments, as anchors. When multiple anchors are being used, researchers should pre-specify which measure will be used for the primary analysis of test–retest reliability, or how they plan to triangulate results from multiple measures.

  2. 2.

    Researchers need to consider the variability of the individual symptoms that are being assessed with their PRO measure. Some symptoms may be relatively stable from day-to-day, while others may vary a great deal. This may be an important consideration when thinking about assessing health status on a single day or over several days.

While not derived directly from the results of the study, we provide the following recommendations regarding the design and evaluation of anchor measures for use in future research.

  1. 1.

    Researchers should be thoughtful in the design of their anchor measure to ensure that it assesses the same or sufficiently similar concept as the PRO measure in question. Researchers should also consider the recall period being used (e.g., current state as compared to status on a specific day or over a period of time).

  2. 2.

    Researchers should consider including the evaluation of the anchor measures as part of the cognitive interview phase for new PRO measure development or when conducting qualitative research with existing measures. If feasible, this will help to ensure the relevance and comprehension of the anchor measures.

Considerations regarding anchor measure selection and implementing test–retest analyses in interventional studies are provided in the online Supplementary Information.

Conclusions

Three PRO Consortium working groups employed both retrospective assessment of change using PGIC as well as “current state” assessment of disease or symptom severity using the PGIS to identify a stable subgroup in which to assess the test–retest reliability of their PRO measures in development. PGIS performed better that the PGIC for the depression and asthma studies, and both anchors performed similarly for the NSCLC study. These results provide empirical evidence about the use of current state and retrospective anchor measures within the context of assessing test–retest reliability. In addition, we have provided recommendations for consideration when including these and other anchor measures in the evaluation of test–retest reliability.