Identifying meaningful change on PROMIS short forms in cancer patients: a comparison of item response theory and classic test theory frameworks

Background This study compares classical test theory and item response theory frameworks to determine reliable change. Reliable change followed by anchoring to the change in categorically distinct responses on a criterion measure is a useful method to detect meaningful change on a target measure. Methods Adult cancer patients were recruited from five cancer centers. Baseline and follow-up assessments at 6 weeks were administered. We investigated short forms derived from PROMIS® item banks on anxiety, depression, fatigue, pain intensity, pain interference, and sleep disturbance. We detected reliable change using reliable change index (RCI). We derived the T-scores corresponding to the RCI calculated under IRT and CTT frameworks using PROMIS® short forms. For changes that were reliable, meaningful change was identified using patient-reported change in PRO-CTCAE by at least one level. For both CTT and IRT approaches, we applied one-sided tests to detect reliable improvement or worsening using RCI. We compared the percentages of patients with reliable change and reliable/meaningful change. Results The amount of change in T score corresponding to RCICTT of 1.65 ranged from 5.1 to 9.2 depending on domains. The amount of change corresponding to RCIIRT of 1.65 varied across the score range, and the minimum change ranged from 3.0 to 8.2 depending on domains. Across domains, the RCICTT and RCIIRT classified 80% to 98% of the patients consistently. When there was disagreement, the RCIIRT tended to identify more patients as having reliably changed compared to RCICTT if scores at both timepoints were in the range of 43 to 78 in anxiety, 45 to 70 in depression, 38 to 80 in fatigue, 35 to 78 in sleep disturbance, and 48 to 74 in pain interference, due to smaller standard errors in these ranges using the IRT method. The CTT method found more changes compared to IRT for the pain intensity domain that was shorter in length. Using RCICTT, 22% to 66% had reliable change in either direction depending on domains, and among these patients, 62% to 83% had meaningful change. Using RCIIRT, 37% to 68% had reliable change in either direction, and among these patients, 62% to 81% had meaningful change. Conclusion Applying the two-step criteria demonstrated in this study, we determined how much change is needed to declare reliable change at different levels of baseline scores. We offer reference values for percentage of patients who meaningfully change for investigators using the PROMIS instruments in oncology. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-022-03255-3.

system enables routine collection and analysis of patientreported outcomes (PROs) on a real-time basis. Identifying patients who had reliable changes in their scores can lead to more valid interpretation and communication of the change and opportunities for clinical action. Change can be studied at the population-level or individual-level: The populationlevel change focuses on the population parameters (e.g., means) over time, while individual-level change focuses on change in an individual's score over time [1]. It has been suggested that 'responders' to treatment need to be identified based on the significance of individual change using indices such as reliable change index (RCI) rather than group-level change [2].
Clinically meaningful change can be discussed only after we determine that an observed change is reliable [3]. It would not be logical to speak of whether a change is clinically meaningful unless we can be confident that change has occurred [4]. RCI is a method for determining the statistical significance of an observed change in a single patient, which is expressed as a ratio of the observed change for an individual to nuisance effects. The RCI indicates whether an individual's observed change "reflects more than the fluctuations of an imprecise measuring instrument" [5]. Of note, RCI traditionally investigated within-person change using between-person statistic, which garnered some attention regarding whether it should use within-individual statistic instead. Though the denominator of the RCI contains group-level statistics, the numerator only contains information about the individual. In this way, the RCI indexes the raw change in the individual to the variability of the group as well as the reliability of the measure, offering an advantage over a raw change score only. In addition, meaningfulness of clinical data such as a laboratory value is not generated using within-individual statistics because such standards would not be accurate at the individual level. Rather, establishing clinical thresholds often relies on distributional data such as probabilities from grouped data or between-subject analyses with defined clinical groups. Similar group-level methods have been applied for PROs to set thresholds for interpreting individual-level PRO data [6], which can lead to increased use of PROs in clinical care.
Recently, there have been efforts to study RCI using IRT statistics. These studies noted that IRT-based methods consider the difference in measurement precision across the scale whereas the classical RCI uses a fixed standard error of measurement (SEM) [7; 8]. Mellenbergh [1] stated that the IRT methods may be more 'person-oriented' than the observed score method based on classical test theory (CTT). In parametric IRT, the concept of reliability is replaced by test information [9]. Items with larger discrimination parameters provide more information at the location indicated by item location parameter. Test information is the sum of item information, and the standard error of estimation in IRT is inversely related to the test information function, which shows that we may have different standard errors depending on how discriminating the set of items are for ranges of attributes measured. Prior literature focused on describing the differences in identification rates between CTT-based and IRT-based RCI statistics. Brower et al. [7] and Jones et al. [10] showed that most people were classified consistently between IRT-based RCI (RCI IRT ) and CTT-based RCI (RCI CTT ). Jabrayilov et al. [11] reported that IRT detects more people as having changed compared to CTT, provided that tests contain many items (e.g., 20 items), whereas for shorter tests (e.g., 5 items), CTT better detects change than IRT. Using Patient Reported Outcomes Measurement Information System-29 (PROMIS-29) physical and emotional distress scales, Hays, Spritzer, and Reise [8] found CTT identified more people as having changed compared to IRT.
The motivation for using IRT-based computer adaptive testing (CAT) is an accurate estimation of true scores with items tailored to patients. The accuracy and efficiency result from using the items with maximum information where patients' true scores lie, and this way, patients can be administered a smaller number of items. Because IRT-based assessments rely on item/test information, it is natural to use standard error of estimation from IRT to assess the reliability of change. The current study investigates the effect of IRT-based standard errors and CTT-based standard errors on identification of reliable change. In addition, we identify patients who had meaningful change in several PROMIS scales using the categorical information provided by singleitem measures that directly communicate patients' categorization of their own symptoms (e.g., PRO-CTCAE: PRO version of the Common Terminology Criteria for Adverse Events).
Unlike CTT, IRT methods do not require pretest and posttest measurements to be based on the same items as long as all items are calibrated on the same scale [11]. However, to fairly compare CTT and IRT, the same items from PROMIS short forms were used at baseline and follow-up. One can fit IRT models to PRO measures that are not built with IRT but this requires additional investigation into model assumptions and model fit. In this study, we used PROMIS because it is well-received in terms of robust IRT parameters based on a large sample representative of the U.S. census, which allows us to focus on our primary objective of the study, the comparison of classifications between CTT and IRT.
Our research questions are as follows: (1) What percentages of patients had reliable improvement and worsening based on CTT and IRT using PROMIS short forms? (2) What is the magnitude of the reliable change scores based on CTT and IRT methods on PROMIS short forms?

Measures
We investigated six version 1.0 short forms derived from PROMIS item banks: Anxiety 8a, Depression 8a, Fatigue 7a with two additional items from Fatigue 8a, Sleep Disturbance 8a, Pain Intensity 3a, and Pain Interference 8a. All scales had eight to nine items, but pain intensity. Pain intensity scale had three items, which let us examine the effect of test length on differences in classifications by CTT and IRT. The PROMIS measures are scored on a T score metric in which 50 is the mean of a general US adult reference population and 10 is the standard deviation (SD) of that reference population. PRO-CTCAE items provide categorical information on patient symptom levels and are used in this study for evaluating meaningful change. PRO-CTCAE mirrors clinician adverse event reporting (CTCAE), was developed with patient and clinician input [12; 13] and has validity and reliability evidence in cancer samples [14; 15]. PRO-CTCAE has also been used as clinical decision support, in which the extreme response categories of "severe"/ "very severe", "quite a bit"/ "very much", or "frequently"/ "almost always" trigger nurse alert [16]. PRO-CTCAE items were available for all six domains and they had five response options such as none, mild, moderate, severe, and very severe. Because each CTCAE grade can inform clinical actions (e.g., In dehydration [17], grade 1 = increased oral fluids indicated, grade 2 = IV fluids indicated, grade 3 = hospitalization indicated, grade 4 = urgent intervention indicated) and based on the study that found each ordinal response choice in PRO-CTCAE served to distinguish respondents with meaningfully different symptom experiences, any 1-level change in the PRO version of CTCAE (PRO-CTCAE) was considered meaningful in this study.
Because the response options used in Anxiety 8a and Depression 8a were "never", "rarely", "sometimes", "often", and "always", we used corresponding PRO-CTCAE frequency items ("In the past 7 days, how often did you feel anxiety?" and "In the past 7 days, how often did you have sad or unhappy feelings?"). Although Fatigue 7a + 2 had frequency-based response options, PRO-CTCAE does not have a frequency item for fatigue, so we used PRO-CTCAE fatigue severity item instead ("In the past 7 days, what was severity of your fatigue, tiredness, or lack of energy at its worst?"). PROMIS Sleep Disturbance 8a asked sleep quality where higher response options indicated greater disturbance, so we used the PRO-CTCAE severity item ("In the past 7 days, what was the severity of your insomnia including difficulty falling asleep, staying asleep, or waking up early at its worst?"). We used the PRO-CTCAE severity and interference items for the corresponding PROMIS Pain Intensity 3a, and Pain Interference 8a.

Defining reliable change
In the literature on clinically important patient-level changes, the RCI has served as cutoff for individual-level change indicating whether the observed change was of sufficient magnitude to exceed the margin of measurement error. The RCI lets us test the null hypothesis that there was no change between measurements. We used a one-tailed test, in which an RCI value exceeding |1.65| indicates reliable improvement or deterioration has occurred. For the CTT approach, the RCI CTT is calculated as where x 2 is the posttest score, x 1 is the pretest score, and SEM 2 is the SEM of the posttest score, and SEM 1 is the SEM of the pretest score. For the pretest and posttest scores, we used the T scores rather than sum scores to reflect the actual reporting metric. SEM is calculated as the SD of either the pretest or posttest score multiplied by the square root of one minus the reliability of the PROMIS short form. In alternative formulations, the denominator can also be computed based on the SEM of the pretest score only (i.e., √ 2 × SEM 2 1 ) but we used the standard errors at baseline and follow-up rather than just the baseline SEM, in which the equality of pre and posttest variances is not assumed [5; 18; 19], to fairly compare with the RCI IRT .
For reliability, we use McDonald's coefficient omega (hierarchical) ( h ) as an estimate of the general factor saturation of a test [20] using R package 'psych' [21]. This conceptualization of reliability (i.e., proportion of variance in the scale scores accounted for by a general factor) is consistent with the unidimensional IRT model that we use for computing RCI IRT . Zinbarg and others [22] compared McDonald's h to Cronbach's α, and concluded h is a better estimate, because Cronbach's α reflects not only general factor saturation but also group factor saturation and even variability in factor loadings. Note that in a truly unidimensional test, Cronbach's α will be very close in value to h . We extracted three group factors in addition to the general factor when estimating h . For pain intensity, we extracted only a general factor, because there were only three items. Although we use h to compute RCI CTT , we also report Cronbach's α in descriptive statistics for comparison.
IRT provides a statistic, the standard error of estimation or SE(θ) , that varies conditionally on trait level, and is inversely related to the amount of information provided by an instrument. The magnitude of the standard error depends on the persons' location and whether items are close to this location on a latent continuum. Thus, this standard error depends on item parameters and the number of items the person has been administered. The RCI in the context of IRT [7; 11] can be defined as where x 2 is the posttest score, x 1 is the pretest score on T-score metric, SE(x 1 ) and SE(x 2 ) are the standard errors of estimation multiplied by 10 at baseline and follow-up (to match the T-score metric), respectively. Standard errors are estimated using expected a posteriori (EAP) estimation. EAP scoring is employed to estimate scores for the vast majority of PROMIS measures due to its attractive properties in the context of computer adaptive testing, especially around test termination [23]. Because we use T scores that are based on IRT person estimates in both RCI CTT and RCI IRT , the only difference between the decisions come down to the different methods of computing standard errors. Because we subtract the pretest score from the posttest score where higher scores indicate more severe symptoms, a positive RCI value indicates the change is in the direction of getting worse and a negative RCI value indicates the direction of getting better.
We used one-tailed tests of significance in the current study. We felt that worsening and improvement need to be examined separately, without assuming the percentages identified as reliably changed are equal between improvement and worsening. For both RCI CTT and RCI IRT , an RCI value greater than |1.65| classifies a patient getting either better or worse.

Identifying patients who had reliable change and meaningful change
The degree of agreement in classifying patients as having experienced change between CTT and IRT methods was expressed in terms of sample sizes and percentages. Among the patients who had reliable change, we further identified patients with meaningful change.

Sample
For each of the scales, we analyzed the scores of adult patients recruited from five cancer centers who had complete responses for both baseline and follow-up data. There were originally 1,859 patients, and after selecting patients with complete responses for both baseline and follow-up assessments, the sample sizes for the 5 scales ranged from 1,089 to 1,162 ( Table 1). The demographic information on the full sample at baseline (n = 1,859) has been previously described [24]. There were 1,253 patients who had PROMIS change scores available in any of the six scales. For reliability estimates, h and the coefficient α were greater than 0.80 across scales. In addition, the h and the coefficient α values were similar, suggesting that the scales could be mostly explained by their respective general factor ( Table 2). Table 3 shows the number and percentages of patients classified as the same, worse, or better by CTT and IRT approaches. Across six domains, we found that CTT and IRT approaches to estimating reliable change agree on the classifications of changes the majority of the time (80% to 98%). When there were disagreements, RCI IRT tended to identify more patients as having changed in their symptoms while RCI CTT suggested that the patients had no reliable change. Sometimes, RCI CTT detected changes that RCI IRT categorized as stable. How can we explain these disagreements? Figure 1 shows how identification rates are related to the standard errors. Typically for T scores greater than 45, the measurement error in the IRT approach is consistently lower than for the CTT approach. In most cases, if a person was categorized as worsened (or improved) by the CTT approach, they were necessarily classified as worsened (or improved) by the IRT approach. There were some instances in which a patient was classified as stable according to the IRT approach but worsening/improving by the CTT approach. When this happened, scores at baseline or follow-up were in the range where 1.65 × SE(̂ ) exceeded 1.65 × SEM. For example, for depression, there were 86 individuals (red dots in Fig. 1) who were classified as stable by IRT but changing by CTT. These patients had either their baseline or followup scores lower than 40, and their scores at both time points were lower than 50. Figure 2 shows the relationship between the scores and the denominator of RCI IRT . Table 4 presents the amount of change in T scores corresponding to an RCI value of 1.65 using the two methods.

Magnitude of the reliable change scores based on CTT and IRT
The amount of change in T scores according to RCI CTT of 1.65 was 7.03 in anxiety, 7.08 in depression, 7.95 in fatigue, 9.22 in sleep disturbance, 7.40 in pain intensity, and 5.14 in pain interference, which were constant across the score range by definition of CTT. The amount of change in T scores for RCI IRT of 1.65 varied across the score range, by definition of IRT, with many of the RCI IRT estimates being below those for RCI CTT , particularly around and above the center of the T score distribution. For pain intensity, on the other hand, IRT standard errors tended to be greater than SEM across the score range, so patients had to score higher to be classified as having changed according to RCI IRT . This is shown by the red dots (reliable change identified by CTT only) scattered across score range for pain intensity in Fig. 1.     Table 4.

Identifying patients who had reliable change and meaningful change based on change in PRO-CTCAE categorical responses
Anxiety There were 502 patients who deteriorated in T scores (T score change > 0   Pain Intensity Among 433 patients who deteriorated in T scores, 252 (58%) had reliable worsening; among 252, 198 (79%) had meaningful worsening using the CTT method. Using the IRT method, 235 (54%) had reliable worsening; among 235, 187 (80%) meaningful worsening.

Discussion
Prior studies focusing on classification rates between RCI CTT and RCI IRT showed consistent classification between RCI CTT and RCI IRT in most patients (e.g., About 78% to 92% in [7; 8; 10]). The current study showed, in PROMIS short forms on core symptoms in cancer patients, RCI CTT and RCI IRT agree on the classifications of changes 83% to 98% of the times. We also demonstrated how differences in standard errors in relation to the score distributions result in differing classification decisions for an individual by IRT and CTT in PROMIS measures. When there were disagreements such that CTT could not detect changes that were detected by IRT, they occurred when measurement errors were overestimated by CTT, where scores at both timepoints were in the Hays et al. [8] used the 4-item scales in PROMIS-29 and found CTT classified 21% as changing in emotional distress but IRT indicating no change. One may wonder the reasons for larger proportions of patients being classified as changing only by CTT in their study compared to 3 to 7% in anxiety and depression in the current study. The four items used in the emotional distress scale in Hays et al. [8] were also in the Anxiety 8a in the current study. The other four items used in the emotional distress scale were also in the Depression 8a used in the current study. Both studies used EAP for standard error estimation. Their reliability estimates were 0.86 and 0.9, close in values to h 's in the current study. Based on these similarities, the major difference from the current study may be attributed to (1) the test length: The prior study used standard error estimates based on four items (average of standard errors for the 4-item depression and those for the 4-item anxiety scales), which would raise the SE(̂ ) shown in Anxiety or Depression in Fig. 1 higher, and (2) the sample score distributions: If there were many patients whose emotional distress level was at the lower side at both time points (e.g., below the population norm) where information is lower, then RCI CTT may have been overly optimistic about detecting changes in these patients.
This study collected responses from a large and diverse sample of patients recruited from multiple cancer centers with a variety of cancer types and stages, as well as investigating a variety of core symptom domains with PROMIS short forms. Of note, this study used patient perspectives for identifying meaningful change in PROMIS short forms. Statistically reliable change alone may not communicate whether patients also find the change meaningful. On the other hand, using a criterion only for meaningful change but not reliable change can result in logical contradiction. For example, in fatigue short form, patients whose PRO-CTCAE scores improved had PROMIS change scores ranging from -23 to 13: This range includes 0 (no change) to 13 (worsening).
A limitation of the current study is that we have not investigated the reliable change in CAT. Although the thresholds derived for reliable change from this study would be largely generalizable to cancer population using PROMIS in their respective domains, further research can be conducted using RCI with SE(̂ ) for PROMIS administered with CAT to see whether changes can be better identified at the lower symptom levels. As the electronic health records (EHR) facilitate longitudinal collection of PRO data, a data field containing whether the RCI IRT exceeds a critical value may provide useful information on reliable worsening or improvement in addition to the T scores themselves.
For questionnaires developed with CTT methods, the RCI CTT can also be implemented in EHR. We showed ( Fig. 1) that baseline and follow-up SEMs were either extremely close or equal, which suggests that computing RCI based on the SEM of the pretest only would not have biased the results. This has an implication for large-scale questionnaires created with CTT methods and implemented in EHR. Because we do not need separate follow-up SEM, a data field containing whether the RCI CTT exceeds a critical value can be populated, immediately after the patient completes a follow-up questionnaire. One limitation for RCI CTT would be that unlike an IRT-based measure that had been calibrated on the U.S. general population, SEM is a more sample-dependent statistic. Just as accurate estimates of all parameters are required from the IRT methods to detect changes, accurate estimates of SEM would be necessary to determine whether an observed change in a new patient is a reliable change.
The current study used T-scores to compute change scores in both CTT and IRT methods, because T-scores are preferred over raw summed scores for the PROMIS measures and we used the metric that would be common when reporting PROMIS scores. It should be noted that T-scores are IRT-based scores. Because studies [25][26][27] have reported reliable change in PROMIS using RCI with CTT-based standard errors despite using IRT-based scores, we used the same approach when we applied the CTT method.
We demonstrated how categorical evaluation of patients' self-reported adverse events can be used for detecting meaningful change in cancer patients' symptoms in PROMIS. A limitation for the current approach is that one would need to administer an additional item asking categorical evaluation of their symptoms at each time point to determine whether the reliable change was also meaningful. Furthermore, the assumption that a 1-level change in PRO-CTCAE is meaningful can be tested in a qualitative study. Cut-scores for PROMIS item bank in anxiety, depression, fatigue, and pain interference have been developed from clinician judgment with bookmarking method [28]. A future study can investigate the detection of reliable and meaningful change in relation to cut-scores, as well as interpreting the change in PRO scores in concert with other aspects of an individual's situation (e.g., trajectory of illness and treatment, personal and social life circumstances, or goals and values). Applying the two-step criteria demonstrated in this study allows determining which individual cases changed reliably, provides a straightforward evaluation of meaningfulness of the change, and facilitates interpretability and communication of PRO results.

Conclusions
The current study demonstrated how two approaches, CTT and IRT, for calculating RCI converge or diverge in assessing individual-level change in PROMIS short forms on core symptoms experienced by cancer patients. The interpretation of change scores should take into account the standard errors that differ across the range of the scores whenever possible. We derived the thresholds for reliable change at different levels of baseline scores for investigators using the PROMIS instruments in oncology. We derived percentages of patients who had reliable and meaningful change as reference values for designing clinical trials.
Funding This study was supported by National Cancer Institute grants R01CA154537 (Sloan) and P30CA015083 (Diasio). Dr. Lee received partial support by the Mayo Clinic Comprehensive Cancer Center Grant (grant number P30CA15083) and Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery at Mayo Clinic.
Data availability Data can be made available upon reasonable request to the senior author. All requests will be reviewed.

Conflict of interest
The authors declare that they have no competing interest to report.
Ethical approval and consent to participate The study was reviewed by the IRB of each of the participating sites, and all patients provided consent to enter the study.

Consent for publication Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.