Background

There has been an important shift toward the use and development of quality patient-reported outcome (PRO) instruments that minimize responder burden and exhibit sufficient reliability, validity, and clinical relevance. [1] These tools can assist in the accurate measurement of clinical outcomes, which are fundamental for rigorous clinical research as well as in improving the quality of care offered to patients. In order for PRO instruments to have these desired research and clinical benefits, validation studies are critical. Fitting this new emphasis, the Journal of Patient-Reported Outcomes has included rigorous studies on the development and evaluation of PROs in the aims and scope of the journal. [2] Determining whether a PRO instrument is responsive—i.e. able to detect a change in a patient’s reported health status or function—is an important pre-requisite for using such instruments to assess treatment effect.

The Patient-Reported Outcomes Measurement Information System (PROMIS) health measure improvement initiative was funded by the National Institutes of Health with the purpose of improving the quality of PROs. The development took a systematic approach, drawing from instruments already in use. Existing items were categorized, reviewed, and revised, creating a large pool of items that were then validated with item-response theory to allow for computer adapted testing, making the PROMIS instruments an important contribution to clinical and research practice while minimizing respondent burden. [3,4,5] The PROMIS Physical Function Computer Adaptive Test (PF CAT) and PROMIS Upper Extremity (UE) CAT instruments can be utilized to measure patients’ self-reported upper extremity health status, and have several advantages over other metrics. [6] The PROMIS UE and PF CAT have both demonstrated favorable performance characteristics and correlate well with the shortened version of the Disabilities of the Arm, Shoulder, and Hand (qDASH) in an orthopaedic upper extremity patient population. [7, 8] The responsiveness of these PROMIS instruments have not yet been evaluated in this same patient population.

Assessing responsiveness requires longitudinal data with repeated measures, where the same individual is assessed with the same instrument on at least two occasions. [9] Responsiveness can be assessed with either internal or external methods. Internal analysis of responsiveness evaluates the level of change based on the size of the differences between scores, and how much scores vary over time. [10] External responsiveness methods use an external anchor to relate the level of change to some other meaningful report of patient change, either a clinical gold-standard assessment or the patient’s own report of change. [11, 12] Detecting change is particularly important for PRO instruments if they are to be used to guide decisions in clinical practice.

The purpose of this study, therefore, is to evaluate the responsiveness of three PROMIS patient-reported outcome measures in patients with hand and upper extremity (non-shoulder) disorders and provide comparisons with the qDASH legacy instrument.

Methods

Patient sample

Institutional Review Board approval was obtained prior to the start of this study and informed consent was obtained from all participants as they sought medical care for orthopaedic conditions. The sample consisted of 255 new patients presenting to an academic upper-extremity (non-shoulder) clinic between the years of 2014 and 2016. All patients were 18 years or older and sought treatment for upper extremity musculoskeletal conditions. At the time of their clinic visits and prior to seeing a physician, patients were administered anchor questions and PROs electronically on a handheld tablet computer. Patients were recruited consecutively and PROs were administered as part of the standard clinic treatment protocol, with 1.5% of patients refusing to participate clinic-wide.

Patients were seen for a variety of upper extremity conditions with treatments including wound and bone care, skin grafts, tendon/ligament repair, incisions, implants, bursas, reconstructions, fractures, transplants, decompression, arthroscopy, endoscopy, nerve blocks, and carpel tunnel surgery. Depending on individual patient circumstance and timing in follow-up care, different patient samples could be included in the different follow-up periods (see Table 1). Also, depending on the diagnostic condition and treatment planning, patients differed in the amount or type of treatment received during the follow-up periods. This variation in treatment and follow-up timing is typical of a standard UE orthopaedic practice. Four patient follow-up periods were examined in this study: (1) 3-month follow-up (i.e., 80 to 100 days after initial assessment; (2) >3-month follow-up (i.e., 90 days or more after initial assessment); (3) 6-month follow-up (i.e., 170 to 190 days after initial assessment); and (4) >6-month follow-up (i.e., 180 days or more after initial assessment). Three and six-months are common time-periods for follow-up in orthopaedic practice. [13,14,15,16,17,18,19,20] These time-points were included in this analysis to correspond with prior literature and clinical practice.

Table 1 Demographics of patients

Patient-reported outcome measures

Three PROMIS instruments were administered to the patients: the PROMIS UE CATv1.2, the PROMIS PF CAT v1.2, and the PROMIS Pain Interference (PI) CAT v1.1. The PROMIS PF CAT v1.2 contains both upper extremity and lower extremity items and draws from a 121-item test bank. The PROMIS UE CAT v1.2 has a 16-item test bank and the PROMIS PI CAT v1.1 has a 40-item test bank. The qDASH was also administered, which is an 11-item, validated, shortened version of the 30-item Disabilities of the Arm and Shoulder (DASH) instrument. [21] The PROMIS instruments were made available through the Assessment Center, a secure web-based portal established by PROMIS developers. [22] Each of the four instruments were administered at baseline (i.e., either within seven days prior to the clinic visit of a new upper extremity condition or on the day of the first clinic visit) and at each follow-up visit patients attended.

All PROMIS instruments were calibrated in the general population with a mean of 50 and a standard deviation of 10 in the T-score scale, with patient scores primarily clustering between 20 to 80 points. [23] The larger the PROMIS PF or UE scores, the higher were the patients’ function, where the larger the PROMIS PI scores, the greater the pain interference experienced by the patients. The qDASH scores ranged from 0 to 100 with higher scores representing lower functioning levels.

Anchor questions

For physical function, patient responses were anchored by the question; ‘Compared to your FIRST EVALUATION at the University Orthopaedic Center: how would you describe your physical function now?’(much worse, worse, slightly worse, no change, slightly improved, improved, much improved). The idea of anchoring a change score to some other measure of patient outcome is to provide a reference point. When that reference point comes from patient reports of noticeable improvement or decline, it may be considered a meaningful level of change. [24] Patients reporting meaningful change (much worse, worse, improved, much improved) were included in the responsiveness analysis to detect the ability of the PROs to measure meaningful levels of change. [25] When there is symmetry in data characteristic, the improved and deteriorated change groups can be considered together creating a distinction between those experiencing change versus those with stable symptomology. [26]

For the PI, the anchor question queried pain (i.e., Compared to your FIRST EVALUATION at the University Orthopaedic Center: how would you describe your episodes of PAIN now?) rather than physical function, and patients reporting pain which was worse, much worse, improved, or much improved since their first clinic visit were included in the responsiveness analyses.

Statistical analysis

Patient demographics were examined and changes in their functional and pain outcomes were evaluated at four time points. Baseline scores were compared to the three-month follow-up scores (90 days plus or minus 10 days), six-month follow-up scores (180 days plus or minus 10 days), 90 days and beyond follow-up scores, and 180 days and beyond follow-up scores on all four patient-reported measures.

Change in the PRO metrics was calculated as the absolute value difference between the baseline score and the follow-up score for each patient. A paired sample two sided t-test was used to test the hypothesis that there was no difference in the PRO measures between time points on an individual patient level [10], with significance level set at p = 0.05. ANOVA was run to test the hypothesis that patients did not differ across levels of change.

A standardized measure of effect size (ES) was calculated using the Cohen’s d. Cohen’s d computes the difference in score between the baseline and the follow-up and then divides this difference by the baseline score standard deviation. This method takes into consideration the variability in scores, a step beyond the mean differences considered in the paired sample t-test [10]. In interpreting Cohen’s d, a small, medium, and large ES can be considered as d = 0.20, 0.50, and 0.80 respectively.

The standardized response mean (SRM) is another important indicator of ES, similar to the paired t-test, but removing dependence on sample size from the equation. [10] This is computed as the mean difference between baseline and follow-up PRO scores divided by the standard deviation of difference scores, reflecting individual changes in scores. Although there is not perfect consensus, recommended guidelines for interpreting SRM values are similar to interpretation of Cohen’s d. [10] All analyses were performed using either SPSS 23.0 (IBM SPSS Statistics for Windows, Armonk, NY: IBM Corp.), [27] or R 3.30 (R Development Core Team, Vienna, AT: R Foundation for Statistical Computing,) [28].

Results

This study included a total of 131 females and 124 males with ages ranging from 18 years to 90 years (mean age = 50.75, SD = 15.84). For demographic information including gender, race, ethnicity, tobacco use, procedure and insurance type, see Table 1.

Mean, SD, range, and median along with mean differences of scores of the PROMIS UE, PF, PI and qDASH are presented in Table 2. Mean change scores for the PROMIS PI ranged from 4.81–10.68 whereas mean for no change scores ranged from 4.32–6.05. The PROMIS PI at 3-month and >3-month follow-up and the qDASH at >3-month follow-up were the only measures and only time-points with confidence intervals (CI)‘s showing a substantial difference between change groups (see Table 3). The PROMIS PF mean change scores ranged from 8.36–8.91 whereas mean for no change scores ranged from 5.92–9.00. The UE had mean change scores ranging from 7.57–9.51 and mean no change scores ranging from 6.67–8.21. Lastly, the qDASH showed mean change scores between 18.18 and 24.22 and mean no change scores between 17.21 and 24.40.

Table 2 Descriptive statistics of PROMIS instruments and qDASH of patients
Table 3 Mean Score Changes for PROMIS Instruments and qDASH

Only 20% of the patient sample had baseline PROMIS PF scores at the average 50th percentile T-score of 50, 5% had PROMIS UE scores over 50, and 5% had an average PROMIS PI pain score of 50, indicating this group had low levels of function and high levels of pain at baseline.

Paired t-test

At the 3-month, 6-month and >3-month follow-up, changes from baseline scores were significant for all instruments (p < 0.05). However, score changes for the >6-month time period varied in significance. The only instrument that did not show a significant change in scores was the UE CAT (p = 0.253), whereas the PF CAT, PI CAT, and qDASH showed significant changes in scores (p < 0.05; see Table 4). For all instruments, the baseline scores were not significantly different between the patients with missing and non-missing follow-up visit scores at all time points (p < 0.05) (results available upon request).

Table 4 Responsiveness of PROMIS instruments and qDASH of patients from baseline

Effect size

All four instruments showed a high degree of responsiveness across all four follow-up periods. For the 3-month follow-up group, all instruments had high responsiveness ranging from 0.84–1.48. The instrument that was the most responsive for the 3-month follow-up was the PI CAT (1.48), whereas the PF CAT was the least responsive (0.84).

The 6-month follow-up also showed high responsiveness ranging from 0.79–0.85. The PI CAT was the least responsive at the 6-month follow-up (0.79) whereas the UE CAT was the most responsive (0.85). When looking at the >3-month follow-up time period of 90 days or more, responsiveness was still high (0.92–0.99). The least responsive measurement for this time period was the UE CAT (0.92) while the PI CAT showed the highest responsiveness (0.99). For the >6-month time period of 180 days or more, all instruments still showed high responsiveness but the PI CAT was the most responsive (0.97) whereas the UE CAT was the least (0.85). Overall, the PI CAT was consistently the most responsive to change when looking at ES (see Table 4). The 95% CI’s of the effect sizes demonstrates a meaningful difference in measure responsiveness at each follow-up time-point for each instrument, though the CI range for all measures dipped to include potential for a small effect in the 6-month follow-up period.

Standardized response mean

All instruments had high responsiveness as measured by the SRM (1.05–1.63). The 95% CI’s around the SRM were all medium-large, ranging from 0.51–2.18, and reflect the overall larger size of effect as measured by the SRM compared to the ES on every measure at every time-point. In the 3-month follow-up group, the most responsive instrument was the PI CAT (1.63) while the PF CAT was the least responsive instrument (1.05) among the four. The 6-month follow-up showed that the PROMIS UE was the most responsive (1.42) whereas the PI CAT was the least (1.16). In the >3-month follow-up time period of 90 days or more, the PI CAT remained the least responsive instrument (1.09) whereas the qDASH was the most responsive (1.26). However, the UE CAT had the highest SRM (1.43) while the PI CAT had the lowest (1.15) for the >6-month follow up time period of 180 days or more. In general, the UE CAT was the most responsive to change when applying the SRM (see Table 4).

Discussion

The main finding of this study is that the PROMIS Upper Extremity CAT, Physical Function CAT, and Pain Interference CAT are responsive to patient reported functional change in a hand and upper extremity (non-shoulder) orthopaedic population. In addition, the magnitude of the responsiveness for each instrument was large. The three statistical methods (SRM, ES, and paired t-test) that were utilized provided similar results in most instances. However, the external validity of assessing change was poor in the PROMIS PF and UE as well as some follow-up time points of the PROMIS PI and qDASH when mean scores were compared in the subsamples with no-change in condition versus meaningful change.

We tested a traditional time-frame for three-month and six-month follow-up capturing a window of 10 days on either side of the follow-up cut-off. Strict cut-off limits restrict the inclusion of patient scores for those who did not have follow-up visits that fit within the narrow time-frames. The relevance of the sampling cut-offs to the interpretation can be seen with the small sample size (18–20 participants) in the 6-month follow-up group (170–190 days). This restricted sample was the only time-point that resulted in a 95% CI around the effect size that ranged low enough to include potential for a small effect in the interpretation. In contrast, the larger sample sizes in the other follow-up periods resulted in CI’s with medium/large to large effects. We also tested 90 days and beyond and 180 days and beyond as alternative time-frames to test the robustness of these cut-offs to the measure’s responsiveness. Our study findings that comparable effect sizes could be seen across the differing follow-up cut-offs, with minimal exceptions, provides cross-validation for the use of commonly used three and six-month follow-up cut-off points.

It is interesting to note that the time-period in which change scores were the greatest differed for different instruments. For the PROMIS PF, there was little difference between change scores at 3 and 6 month follow-up. For the PROMIS PI, pain interference change was greater at the earlier follow-up points. The PROMIS UE and qDASH similarly showed more change in function at earlier time points. These differences likely represent the greater heterogeneity in patient condition and treatment factors that occur by later measurement periods, but may also reflect the nature of improvement in upper extremity disorders. It may also reflect the low level of functioning and high levels of pain reported by this sample of upper extremity patients at baseline visits.

Prior work on the measurement characteristics of the PROMIS UE, PF, and PI CAT in a hand and upper extremity patient population have demonstrated the validity of these measures while minimizing respondent burden [8, 29,30,31,32]. Whether or not these PROMIS instruments are able to detect patient reported change in health or function, however, has remained an important albeit open question. This study demonstrates the responsiveness of these three PROMIS instruments. Understanding responsiveness to change is essential in translational research to advance clinical trials, comparative effectiveness studies and most importantly, clinicians’ knowledge in interpreting outcome measures enabling more meaningful interactions with patients.

Limitations

All patients visiting the hand and upper extremity orthopaedic clinic were included in the assessment of responsiveness, and we did not characterize our results based on individual diagnosis or treatments. Differing disease conditions and/or treatments may show different responsiveness indices, and therefore the findings of this study should be considered preliminary. Future work may include investigation of the responsiveness of the PROMIS instruments for individual conditions and treatments. The sample size for the 6-month follow-up was small and results from this time-point may not be as reliable as those with larger samples. We are continuing to collect data from patients and will conduct further study with larger samples and different time frames as data become available. Future work should be performed to analyze upper extremity conditions at varying levels of function, not just change, to see if instruments are as responsive to those with high functioning as to those with lower levels of function. It would also be useful to consider the differences by anchor score, of those reporting varying levels of improvement. The PROMIS PF has been shown to have a ceiling effect especially in relation to items that fall in the upper extremity areas of function. [29, 33] In this patient population, functioning levels were low, so the ceiling effect likely did not impact the results. Both the PROMIS PF and PROMIS UE would benefit from this additional analysis of responsiveness at the upper levels of function in future research, potentially using Rasch modeling based on the distribution of scores rather than the external anchoring.

Conclusions

The PROMIS UE CAT, PF CAT, PI CAT, and qDASH were able to effectively detect change in physical function and pain interference in an orthopaedic hand and upper extremity clinic. The responsiveness of the PROMIS instruments demonstrated by this study adds to the prior rigorous psychometric validation of instruments reported in the literature, and should assist clinicians and researchers to make informed decisions regarding instrument selection in assessing patient reported outcomes in the upper extremity [34].