Background

Function is a widely targeted area for osteoarthritis (OA) treatment, including pharmacological and non-pharmacological approaches. Patient reported outcomes (PROs) that assess quality of life such as decreased function are increasingly recognized as an important component of OA intervention outcome assessment. Numerous PRO measures are used in OA research [1,2,3,4,5,6], but a review of the psychometric properties of 32 OA outcome measures concluded that these measures are uniformly rated low in terms of their ability to detect change [1]. The ability of a measure to detect change is of utmost importance when considering which measure to use for research trials and to monitor clinical progress.

Traditional fixed-form PRO measures, such as the 36-Item Short Form Health Survey (SF-36) and the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), are comprised of a fixed number of items that are administered to all participants. To reduce administration burden, measures are constrained to a limited number of items, often lacking coverage of the broad range of ability typically observed in patient populations [5,6,7]. This limitation of fixed-form PRO measures raises concerns about the loss of score precision and the reduced ability to measure clinically meaningful change [8, 9]. Furthermore, these measures require respondents to answer every question even though some may be redundant and inappropriate for an individual [10,11,12].

Item response theory (IRT) [13] and computerized adaptive testing (CAT) [14] are advances in measurement that have great potential to overcome limitations of fixed-form PRO measures and improve the ability to assess the wide range in functional ability seen in persons with OA. IRT-based measures use calibrated banks of items that are hierarchically organized from low to high ability for the outcome of interest. These banks of items can be administered as CATs, which employ computer-based algorithms to select items that match an individual’s ability based on his/her responses to previously administered items. Using this approach, a relatively small number of items (e.g., 4–10) can generate a precise estimate of the individual’s ability for a specific outcome domain. In this way, IRT/CAT measures can provide adequate measurement breadth, precision and sensitivity to change without being burdensome, characteristics that are important for measures used in research and clinical practice [15].

Two IRT/CAT measures are available for use in OA, one is disease-specific and one is generic. The Osteoarthritis Computerized Adaptive Test (OA-CAT) is an IRT/CAT measure developed specifically for use in OA [16, 17]. Disease-specific measures contain items, responses, or other aspects of the measurement that are developed specifically for the disease or condition being measured. In contrast, generic measures are designed to assess outcomes with the expectation that the measurement models can be applied across populations without regard to the patient’s specific condition [18]. The Patient Reported Outcomes Measurement Information System (PROMIS®), funded by the National Institutes of Health, was designed as a generic measure to promote comparability across studies and clinical diagnoses [19]. Several PROMIS® domains, including physical functioning [19], can be used to assess outcomes in persons with osteoarthritis.

Faced with measurement choices among emerging PROs for OA research, it is important to examine an instrument’s psychometric properties in the population of interest. Indeed, to date there is no sensitivity to change data among persons with OA for either the OA-CAT or PROMIS® instruments. The intent of this study, therefore, was to examine the ability of two PRO CATs – the generic PROMIS® Physical Function (PF) CAT v1.0 and the osteoarthritis-specific OA-CAT Functional Difficulty (FD) scale – to detect change in physical function in a sample of adults with knee OA engaged in an exercise program designed to improve physical function.

Methods

Design and participants

This study, part of a randomized controlled trial to examine the effects of an intervention to improve long-term exercise adherence, reports on data collected at baseline and at the conclusion of a 6-week exercise program designed to improve function in persons with OA. The sample included 120 persons with knee osteoarthritis. The two CAT PRO measures were administered to all participants prior to engaging in a 6-week structured exercise program (baseline) and again after the exercise program was completed (post-exercise training). The exercise program was the same for all participants. OA-CAT FD and PROMIS® PF CAT data reported in this study were collected prior to randomization into intervention and control groups for the exercise adherence study.

Participants were recruited from the greater Boston area. Inclusion criteria were age > 50 years with self-reported, doctor-diagnosed knee OA and knee pain, which was determined by answering yes to both “Have you had pain on most days of the previous month?” and “Have you had pain for most months of the past year?” Exclusion criteria included a medical condition precluding exercise (stroke or myocardial infarction in the past 3 months, treatment for cancer, severe systemic disease), a medical condition that limits physical function more than the knee pain (including back or hip pain more severe than knee pain), inflammatory arthritis, current regular exerciser, regular resistance training (one or more times per week) in the last 6 months, plan for knee replacement during the trial, dementia or inability to follow exercise instructions and use the exercise adherence intervention.

Measures

Characteristics of study measures are summarized in Table 1. Study measures included two IRT-based PRO measures that use a CAT approach: the PROMIS® PF CAT v 1.0 scale and the OA-CAT FD scale. Development of PROMIS® measures was initiated by the National Institute of Health Roadmap for Medical Research to advance the science and application of PROs in chronic diseases [18, 20]. The PROMIS® PF CAT is linked to a U.S. general population and the scale has a mean of T = 50 with an SD of 10 [21]. The PROMIS® PF scale assesses one’s capability to perform a variety of physical activities, including mobility (lower extremity function), dexterity (upper extremity function), axial (neck and back function), and ability to carry out instrumental activities of daily living. The PROMIS® PF v 1.0 CAT was administered using the default stopping rule (minimum of 4 items and maximum of 12 items) or reaching a standard error of 0.40. The OA-CAT [16, 17] was developed to measure conceptually distinct dimensions of functional pain, functional difficulty, and disability that are relevant for OA clinical practice and research. OA-CAT was developed based on an osteoarthritis sample with a scale range from 10 (lowest score) to 90 (highest score). The OA-CAT stopping rule is a maximum number of items (10) or reaching a standard error < 0.25. Preliminary studies show that the OA-CAT has improved breadth, precision and reliability and reduced floor and ceiling effects compared with the WOMAC, a traditional fixed-form PRO instrument [16, 17]. In this study, the OA-CAT FD scale was compared to the PROMIS® PF CAT. The OA-CAT was developed and its psychometric properties were examined using an analytic approach similar to PROMIS, but there are a few differences. Instead of using Graded Response model (GRM), we calibrated the OA-CAT item bank using Generalized Partial Credit model (GPCM). Because those two models have exact the same number of parameters and usually data fits one of the model will also fit the other model [22], so different models should have little impact on calibrating the item bank. Second, we didn’t remove the local dependent items as PROMIS did, instead in the CAT program we treated those local dependent items as “enemy” items – the CAT program only allows to select one item within a set of locally dependent items. Third, we applied weighted maximum likelihood (WML) estimation method in estimating the person score not Expected A Posterioi (EAP) method. The studies have shown WML estimator had less bias than EAP estimator [23, 24].

Table 1 Summary of Measures

Data analyses

Descriptive analysis: Descriptive analyses were performed on assessments completed at baseline and post-exercise training. Means and standard deviations (SD) were computed for each measure: PROMIS® PF CAT and OA-CAT FD.

OA-CAT FD and PROMIS® PF comparability: It is important to examine comparability when assessing ability of the measures to detect change. Four analyses were conducted to compare measures: 1.) Pearson correlations examined the degree to which the OA-CAT FD and PROMIS® PF CAT assess a similar construct; 2.) score reliability, calculated as [1-average (squared standard error/variance of score)]; 3.) score distributions; and 4.) the number of items administered and time to complete each instrument.

Ability to detect change: Change scores were calculated based on assessments completed at baseline and post-exercise training. The Cohen effect size (ES) was computed as a standardized indicator of the ability of each instrument to detect true change [25]. The ES expresses change scores in terms of the underlying sampling distribution using SD estimates [26], and was calculated as (M2-M1)/Sb [27]. M2 is the mean post-intervention score and M1 is the mean pre-intervention score of each assessment; Sb is the baseline SD. The larger values of ES indicate greater ability to detect change. An ES of 0.2 reflects small change, an ES of 0.5 reflects moderate change, and an ES of 0.8 reflects large change [27].

The bootstrap method was used to calculate the 95% confidence intervals (CI) for ES estimates of each measure and to test for significant differences in ES estimates of the two measures [27]. Five thousand bootstrap samples of the difference in ES estimates across the measures were generated and estimates of differences in ES values were rank-ordered, and the 95% CI for the differences in ES estimates between the measures was calculated.

Results

Table 2 shows the baseline demographic characteristics of the 120 participants consented into the study; of these, 104 completed the exercise program for whom we had follow-up data. Demographic characteristics of the sample are not statistically different for those enrolled at baseline and those who completed the 6-week exercise program, except for the percent of the sample identified as obese, which was significantly lower for exercise program completers.

Table 2 Sample Demographics

Instrument comparability

OA-CAT FD and PROMIS® PF CAT scores were strongly correlated, with similar correlation coefficients at baseline and post-exercise training (0.65 and 0.69, respectively). OA-CAT FD and PROMIS® PF CAT score reliability values were similar at baseline (0.90 and 0.85, respectively) and post-exercise training (0.89 and 0.85, respectively). OA-CAT FD and PROMIS® PF CAT score distributions at baseline and post exercise training have similar distributions. (See Fig. 1a and Fig. 1b.)

Fig. 1
figure 1

a OA-CAT Functional Difficulty Score Distribution: Baseline and Post-Exercise Training. b PROMIS® Physical Function CAT Score Distribution: Baseline and Post Exercise Training

The average number of items administered by the OA-CAT FD at baseline was 10.03 (SD = 2.62) and the time to complete was 2.76 (SD = 1.48) minutes. For the PROMIS® PF, an average of 4.14 (SD = 0.54) items were administered and it took 1.8 (SD = 0.66) minutes to complete.

Sensitivity to change

The mean change of OA-CAT FD and the PROMIS® PF CAT scores from baseline to post-exercise training are presented in Table 3. Findings for both the OA-CAT FD and the PROMIS® PF CAT instruments show significant ESs, with the OA-CAT FD having a somewhat greater ES [0.62 (CI = 0.43, 0.87) compared to the PROMIS® PF CAT [(0.42 (CI = 0.24, 0.63). A statistical comparison of ES for the two measures is displayed in Fig. 2, demonstrating no significant differences in ES values between the OA-CAT FD and the PROMIS® PF CAT [95% CI = (− 0.43, 0.004)].

Table 3 Descriptive statistics of PROMIS® PF CAT and OA-CAT FD
Fig. 2
figure 2

OA-CAT Functional Difficulty and PROMIS Physical Function Effect Size

Discussion

This study compared the relative ability of two PROs – OA-CAT FD, an OA-specific measure, and the PROMIS® PF CAT, a generic measure – in a sample of adults at risk of knee OA who completed in an exercise program designed to improve physical function. Results demonstrated that both instruments are able to detect change in function after participation in an exercise program known to have a beneficial effect on functional outcomes. The strong correlation between the two instruments provides evidence that they measure the same construct. Both scales were similar in terms of score reliability and score distribution. The OA-CAT administered more items and took longer to administer. This is likely due to the fact that OA-CAT includes a response option ‘Did not do the activity for reasons other than the arthritis in his/her legs.’ If this response is selected, the item is not included in the score determination. While more items are administered and it takes an additional minute to administer, the OA-CAT score may more accurately reflect the impact of arthritis on function. While no statistically significant difference in ES values was noted between the two instruments, trends suggest that the OA-CAT scale achieved a larger ES. These findings may indicate that the exercise intervention for more effective for improving function impacted by OA than it was for improving general physical function. Thus, the OA-CAT FD may be preferred for use in clinical trials, particularly when study power is somewhat compromised by sample size. Nonetheless, our study also supports use of the PROMIS® PF CAT to measure change in function among people with OA. PROMIS CAT PF mean score and range at baseline are similar to reports using the PROMIS PF short forms to assess function in a sample with symptomatic knee OA [28]. This study used Physical Function item bank v1.0 and newer, refined PROMIS® items banks include the addition of a Mobility v1.2 item bank focused on a range of activities from getting out of a bed or chair to activities such as running. This item bank may function better in a sample of persons with osteoarthritis than the Physical Function v1.0 item bank, which includes upper extremity activities and instrumental activities of daily living [29].

Performance-based measures were collected in this study, include the five-time-sit-to stand test (FTSST) shown to detect change in persons with OA completing a strength training program [30]. We examined FTSST baseline and post-exercise training scores to provide additional evidence that the exercise program resulted in change. The FTSST ES (0.5) is similar to the ES reported for the OA-CAT (0.62) and PROMIS (0.54), indicating that the exercise program produced similar change in a measure of performance. However, changes in strength may not translate into a similar change in function. In fact, correlations between FTSST times and OA-CAT and PROMIS scores were weak at baseline and post exercise training (− 0.21 to − 0.34), indicating that the FTSST measures a different construct.

Both the OA-CAT FD and the PROMIS® PF CATs provide the many advantages of contemporary measurement approaches compared to traditional fixed form PROs [31]. The primary difference between the two instruments is that the PROMIS® Physical Function CAT is a generic PRO measure and the OA-CAT is an osteoarthritis-specific PRO. Debates over the pros and cons of generic and disease-specific functional measures are frequently found in the literature [32, 33]. The generic instruments are universally applicable across diseases by measuring an overall condition, while the disease-specific instruments can usually tap into more and deeper disease-specific effects. Our study findings indicated that both the generic PRO measure and the osteoarthritis-specific PRO measure showed significant sensitivity to change and that both have potential usefulness.

Study limitations include a relatively small sample size and loss to follow-up. With a larger sample size, a significant difference in ESs may have been detected. Second, we did not compare the sensitivity to change of these instruments to other standard patient reported outcomes (e.g., Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC)). Third, the exercise program was relatively brief and a longer duration could have resulted in greater ‘true’ change, enhancing the ability of these instruments to detect change. Fourth, we only have one follow-up data point and are unable to determine the sensitivity to change over longer periods of time which is often needed for chronic conditions such as OA.

Conclusions

The investigation of the sensitivity to change for the OA-CAT FD and PROMIS® PF CAT suggests that both instruments have significant sensitivity to change in persons with OA. While no statistically significant difference was found in the sensitivity to change between the two instruments, the trends revealed that the OA-CAT achieved larger ES. Both instruments have potential usefulness in arthritis research.