Main text

Introduction

Patient-reported outcome measures (PROMs) are validated and standardized questionnaires completed by patients that evaluate a patient’s insight into their quality of health, return to baseline function, and mental well-being [1]. Within the field of knee arthroplasty, commonly used PROMs include the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), the Knee Injury and Osteoarthritis Outcomes Score (KOOS), the Oxford Knee Score (OKS), and the Lower Extremity Activity Scale (LEAS). These validated PROMs are collectively referred to as legacy instruments. However, completion of PROMs can be a challenging technological and administrative enterprise [2,3,4]. Additionally, lack of standardization among PROMs leads to multiple PROM questionnaires being administered per patient; this can increase the potential for “survey question fatigue” and resultant data incompleteness [5,6,7]. Further, Sabah et al. demonstrated that though there are eight joint-specific PROMs available to evaluate knee replacement outcome scores, only three of these (KOOS, LEAS, and WOMAC) have sufficient evidence of validity [8]. Overall, financial barriers, administrative constraints, and lack of standardization have been cited as obstacles to effective PROM collection and utilization. [5,6,7]

The Patient-Reported Outcome Measurement Information System (PROMIS) is an envisioned gold-standard outcome measurement tool intended to provide psychometrically standardized and validated patient-reported outcomes [9]. PROMIS measures a patient’s physical and mental health across multiple domains such as pain intensity (PROMIS Pain Intensity), pain interference (PROMIS PI), and physical function (PROMIS PF) [10]. Pain interference assesses the effects of pain on the important aspects of one’s life, such as the extent to which pain prevents engagement with emotional, social, physical, and cognitive activities. Likewise, PROMIS global domains such as global mental health (GMH) and global physical health (GPH) assess the overall state of one’s mental and physical health, respectively. It is administered in two forms: a computer adaptive test (CAT) and short form. Based on the item response theory (IRT), the CAT test is a sequence of consecutive questions tailored for delivery based on real-time patient responses, thus allowing for streamlined questioning based on the subject’s previous responses. Studies have shown CAT to be more time-efficient than traditional legacy instruments, requiring fewer questions to reach the identical level of responsiveness [11, 12]. Other advantages of PROMIS include its validity, responsiveness, coverage, and decreased floor and ceiling effects compared with legacy instruments across various orthopedic patient populations. [13,14,15]

To support widespread implementation of PROMIS within the practice of specific orthopedic subspecialties and in the perioperative periods of specific procedures, it is important to evaluate the validity and current utilization of these instruments. The purpose of this study is to provide a systematic review of the literature pertaining to PROMIS validation and utilization as an outcome metric in total knee arthroplasty (TKA) patients.

Methods

This systematic review was performed according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [16]. The PROSPERO database was searched by two of the authors (NC, PG) for existing systematic reviews on the present topic. Pubmed/MEDLINE and Embase databases were searched using the search terms (“patient-reported outcomes measurement information system” OR “PROMIS”) AND (“knee”) AND (“arthroplasty”) to identify all relevant literature. Inclusion and exclusion criteria for the resultant abstracts were applied once duplicates were excluded. Two independent reviewers evaluated the identified abstracts; if no consensus was reached regarding the choice to include or exclude a particular study, the senior author was consulted, and a consensus was reached. For each included article, the references were manually scanned for additional papers not identified in the original database search. As this was a systematic review of published literature, institutional review board approval was not required.

Published articles reporting on PROMIS in TKA patients were identified (Fig. 1). The inclusion criteria were as follows: any study with level of evidence I–IV that reported the psychometric properties [Pearson and Spearman correlation with legacy instruments, floor and ceiling effects, responsiveness, and minimum clinically important difference (MCID)] and/or utilization of PROMIS in total knee arthroplasty. Study characteristics, patient demographics, psychometric properties, and PROMIS outcomes were recorded and analyzed. Exclusion criteria were as follows: level of evidence V, conference abstracts, reviews, articles written in non-English languages without translation available, and studies that did not distinguish between TKA or total hip arthroplasty (THA), revision TKA, unicompartmental knee arthroplasty, or general knee surgery. Studies were further excluded if PROMIS data were not able to be directly extracted.

Fig. 1
figure 1

Step-by-step process of study identification and inclusion

The following information was collected: level of evidence, study design, number of patients, mean age of patients, and gender distribution. Pearson’s r and Spearman’s rho were collected to assess the correlation between PROMIS and legacy measures, which included: KOOS Joint Replacement (KOOS-JR), KOOS Functions in Activities of Daily Living (KOOS-ADL), KOOS Sports, KOOS Physical Function Short Form (KOOS-PS), Brief Resilience Scale (BRS), OKS, and modified single assessment evaluation (M-SANE). In this systematic review, very weak correlation is defined as r = 0–0.19, weak as r = 0.20–0.39, moderate as r = 0.40–0.59, strong as r = 0.60–0.79, and very strong as r = 0.80–1.00. Floor and ceiling effects were recorded to assess the coverage (the ability of an instrument to detect the full range of scores of a given measure in a patient population). The generally accepted benchmark for which floor and ceiling effects are considered significant ranges from 5% to 15% of participants scoring the minimum or maximum scores [17, 18]. Responsiveness was evaluated through recorded preoperative and postoperative PROMIS and legacy PROM scores. The established MCID values per PROMIS domain and legacy instrument, as well as the percentage of patients who achieved MCID, was also collected. MCID is the smallest change in scores on a given instrument that represents a clinically important difference to the patient [19]. In general, a smaller MCID is desirable, as that indicates that the change in score is real and not attributed to measurement error.

Studies that did not include analysis of the psychometric properties of PROMIS but reported on PROMIS as an outcome and predictive metric in TKA were also included. Additional data such as patient population, intervention, follow-up time, primary assessment, and main PROMIS findings were collected from each article.

Methodological quality and risk of bias assessment

Two authors (NC, PG) applied the Methodological Index for Non-Randomized Studies (MINORS) criteria to each study included in the systematic review to assess the methodological quality [20]. Cohen’s kappa values were computed to analyze inter-reviewer reliability for each item of the MINORS criteria and assess the degree of concurrence between the two blinded reviewers. The highest attainable score is 24, which denotes that the study under scrutiny is of good methodological quality.

Multiple steps were taken to avoid publication bias, such as including studies that reported both positive and negative results, applying the MINORS criteria to each included study.

Statistical analysis

Weighted averages for the following metrics were calculated: the correlation of PROMIS with legacy instruments (via Pearson and Spearmen correlation coefficients, in which absolute values of coefficients were used); floor and ceiling effects; preoperative, < 3 months postoperative, and ≥ 6 months postoperative PROMIS scores (to assess responsiveness); and MCID scores. The study sample size was used to assign weights. One-way ANOVA and student t-tests were used to determine statistically significant differences. All statistical analyses were carried out in Microsoft Excel Version 16.72, with a cutoff for statistical significance at p < 0.05. Meta-analysis of correlations (https://data-play3.shinyapps.io/Meta_corr/) software was used for meta-analysis, with application of a fixed-effects model to determine whether a statistically significant difference existed between the correlations of PROMIS to legacy instruments between studies.

Results

The preliminary database search yielded 184 studies. Once duplicates were removed, the remaining 122 studies were screened and 15 studies investigating PROMIS in 11,140 patients were ultimately included in this systematic review (Fig. 1; Table 1) [21,22,23,24,25,26,27,28,29,30,31,32,33,34,35].

Table 1 Inclusion and exclusion criteria of total knee arthroplasty studies encompassed in this review

Validity

Seven studies (n = 7011 patients) reported on PROMIS validity in TKA (Table 2) [21,22,23, 25, 27, 31, 34]. The weighted-average Pearson correlation coefficient was 0.62 [standard error (SE) = 0.06] and the weighted-average Spearman correlation comparing PROMIS domains with legacy PROMs in TKA patients was 0.59 (SE = 0.06) (Fig. 2). The direction of the correlation cannot be commented on as these were absolute value calculations. When evaluating individual PROMIS domains, the weighted-average correlation coefficient for PF was r = 0.79, r = 0.64 for PI, r = 0.30 for GMH, and r = 0.5 for GPH.

Table 2 Pearson and Spearman correlations between PROMIS and knee arthroplasty legacy instruments
Fig. 2
figure 2

Weighted-average strength of the correlation between PROMIS and legacy for Pearson and Spearman correlation procedures. Data bars represent mean ± 1 standard error

Floor and ceiling effects

Three studies reported on the floor and ceiling effects of PROMIS domains and legacy instruments (Table 3) [21, 23, 33]. PF had 0% floor and 0.13% ceiling effects, while PI had 8.5% floor and 0% ceiling effects. GPH had 0.1% floor and 0.2% ceiling effects, and GMH had 0.007% floor and 6.3% ceiling effects. Among PROMIS, KOOS-JR, and M-SANE there were no differences in weighted average floor effects [PROMIS: 0.03% (SE = 3.1) versus KOOS-JR: 0% (SE = 0.1) versus M-SANE: 0.01% (SE = 1.1); p = 0.25]. There were no differences in weighted ceiling effects between PROMIS and legacy instruments [PROMIS: 0.01% (SE = 0.7) versus KOOS-JR: 0.02% (SE = 1.4) versus M-SANE: 0.04% (SE = 3.5); p = 0.36] (Fig. 3).

Table 3 Floor and ceiling effects in PROMIS and legacy instruments
Fig. 3
figure 3

Results of floor (left) and ceiling (right) effects between PROMIS, KOOS-JR, and M-SANE. Data bars represent ± 1 standard error

PROMIS responsiveness

Six studies on 3949 patients reported preoperative and postoperative PROMIS scores (Table 4) [21, 23, 28,29,30, 33]. For GPH, GMH, PF, and PI domains, weighted-average preoperative, < 3 months postoperative, and ≥ 6 months postoperative scores were calculated (Fig. 4). GPH, GMH, and PF all increased from baseline to < 3 months postoperative. (Table 4). These differences were statistically significant for GPH at < 3 months postoperative (baseline to < 3 months postoperative: 38.6 ± 5.3 to 42.4 ± 5.2, p = 0.002) and ≥ 6 months (baseline to ≥ 6 months postoperative: 38.6 ± 5.3 to 44.6 ± 5.7, p = 0.002). The change from baseline to ≥ 6 months postoperative was statistically significant for PF (baseline to ≥ 6 months postoperative: 36.7 ± 5.4 to 42.8 ± 7.4, p = 0.001). PI scores significantly decreased from baseline to postoperative (baseline to ≥ 6 months postoperative: 64.0 ± 6.0 to 53.6 ± 9.6, p = 0.001) Likewise, depression scores decreased from baseline to postoperative, but this was not found to be significant (baseline to ≥ 6 months postoperative: 47.9 ± 9 to 43.3 ± 8.8, p = 0.08). Changes from preoperative to postoperative GMH scores were not found to be statistically significant at < 3 months postoperative (baseline to < 3 months postoperative: 47.0 ± 6.6 to 47.3 ± 5.7, p = 0.2) and ≥ 6 months postoperative (baseline to: 47.0 ± 6.6 to 47.9 ± 5.5, p = 0.1).

Table 4 Preoperative and postoperative PROMIS scores in total knee arthroplasty studies
Fig. 4
figure 4

Weighted-average percentage of patients reaching MCID in PROMIS GPH, GMH, PF, PI, and depression scores. Data bars represent mean ± 1 standard error

MCID

Five studies reported on the MCID and the percentage of patients reaching MCID with PROMIS domains (Table 5) [21, 24, 26, 32, 33]. The weighted-average percentage of patients achieving MCID was 59.1% for GPH, 26.0% for GMH, 52.7% for PF, 67.2% for PI, and 37.2% for depression.

Table 5 PROMIS and legacy instrument MCID values and percentage of patient achievement

Five studies on 3455 patients utilized PROMIS as an outcome metric (Table 6) [26, 28, 30, 31, 33]. These studies assessed outcomes such as the effectiveness of a certain implant, the effect of preoperative mental health on postoperative outcomes, and the feasibility of PROMIS in bundle-care payment improvement patients.

Table 6 PROMIS utilization as an outcome metric across total knee arthroplasty studies encompassed in this review

Meta-analysis

A meta-analysis was performed on seven studies reporting correlations; statistical heterogeneity was found among studies (I2 = 98%, p < 0.0001). Overall, the meta-analysis showed significant difference between correlations (p < 0.0001).

MINORS criteria

The mean MINORS score for all included studies was 19.3 ± 1.3 (range: 17 to 22) points (Table 1). All items assessed using the MINORS criteria demonstrated excellent inter-reviewer reliability with k-coefficient ≥ 0.7 in all items (Appendix 2 Table 9).

Discussion

This is the first systematic review on the psychometric properties and use of PROMIS in assessing TKA outcomes. PROMIS has a moderate-to-strong correlation with the legacy instruments KOOS-JR, KOOS-ADL, KOOS-ADL Sports, OKS, M-SANE, and Brief Resilience Score (BRS), establishing strong criterion validity for PROMIS in TKA patients. PF consistently had strong to very-strong correlations across all compared legacy instruments. GMH was the most variable, with correlations ranging from very weak (r = 0.15) to strong (r = 0.70). This variability within the PROMIS umbrella indicates that certain PROMIS domains may not be as valid for the evaluation of TKA outcomes as others. Our results suggest that the global PROMIS domains (GMH more so that GPH) are less consistently correlated with legacy instruments, while the specific domains (PF and PI) are more consistently strongly correlated. This is likely due to the nonspecific nature of the global domains and their associated broad question banks.

Our study demonstrated 0% floor and ceiling effects for the majority of PROMIS domains (Table 3), showcasing the ability of PROMIS instruments to differentiate between patients both severely and minimally affected by knee arthritis and subsequent TKA. There were no differences found between PROMIS metrics and legacy instruments on pooled analysis of floor and ceiling effects, implying that PROMIS measures are equally effective as KOOS-JR and M-SANE at demonstrating measurement accuracy. PF and PI had insignificant floor and ceiling effects, falling below the 5–15% cutoff range. Similarly, both global PROMIS domains had low floor and ceiling effects. These findings are consistent with the general orthopedic literature: in a cross-sectional study of 94 patients with general knee pain, there were no floor or ceiling effects found with PF, and minimal floor and ceiling effects with KOOS-JR (3.4% and 1.1%, respectively) [36]. Conversely, in the only study that examined PROMIS depression, depression was found to have large floor effects (Lawrie: 21% preoperatively and 38% postoperatively) [33]. The elevated floor effects may indicate that depression may have limited responsiveness in patients who have a lower severity of depression and that its use is limited.

Responsiveness results were variable across different forms of PROMIS. Poor responsiveness was found in GMH and depression. Scores within these domains may be prone to fluctuation, as an individual’s baseline mental health score may be more likely to oscillate than one’s physical function due to mood changes unrelated to TKA. Regarding the global PROMIS domains, responsiveness was more robust at later follow-up (> 6 months postoperative). Responsiveness of an outcome measurement is crucial, as the ability of that tool to discern change through the course of treatment directly augments the ability of that tool to predict outcomes. The ability of PROMIS to capture the assessment of a TKA patient’s overall health while also accurately detecting changes in their knee function further supports its use in this population.

Notably, GPH, PF, and PI had high percentages of patients reaching MCID, with PI most effectively capturing clinical improvement, as evidenced by the achievement of MCID. GMH and depression were less able to showcase clinical improvement, with only 26% and 37.2% of patients achieving MCID, respectively. No statistical differences were found between PROMIS domain and the ability of the legacy instrument to capture clinical improvement. MCID values for PROMIS domains in TKA patients are listed in Table 7, and can be used by clinicians to predict and counsel a patient on how much they are expected to improve after TKA. MCIDs are valued in patient outcomes research as they serve as a standard for treatment success. Reporting MCIDs with PROMIS scores can lead to a more meaningful interpretation of outcome scores.

There are limitations to our study. First, there was some heterogeneity among the PROMIS domains used in this study. Of the seven studies that evaluated PF and PI, four used PF and PI CAT, one used the short form, and two were unspecified. Yet despite this mild heterogeneity, scores from the various forms were comparable, likely due to the common question bank from which they were generated. Another limitation relates to the small number of patients in the analyses of correlation coefficients for some of the PROMIS domains, which could affect the potential clinical significance of the correlation coefficients. The correlation coefficient for PI to KOOS-JR was found to be 0.64, based on a sample size of 124 patients. Despite the strong correlation identified in this study, the small sample size may limit overall generalizability. Additionally, two-thirds of our included 15 studies were retrospective in nature. None used any blinding methods, which may have resulted in selection bias.

Conclusions

Our findings indicate that PROMIS forms are valid and reliable for use by knee arthroplasty surgeons and researchers to evaluate the clinical outcomes of patients undergoing TKA. Given the strengths of validity, responsiveness, ability to detect clinical improvement, and low floor and ceiling effects of both PROMIS PF and PI, we recommend those PROMIS domains be further utilized and studied in the TKA patient population, as they were found to be the most effective in this study. As a standardized method of PROM assessment, PROMIS can help create valid, reliable, and consistent interpretation of the TKA patient population across the orthopedic literature.