A composite measure for patient-reported outcomes in orthopedic care: design principles and validity checks

Background The complex, multidimensional nature of healthcare quality makes provider and treatment decisions based on quality difficult. Patient-reported outcome (PRO) measures can enhance patient centricity and involvement. The proliferation of PRO measures, however, requires a simplification to improve comprehensibility. Composite measures can simplify complex data without sacrificing the underlying information. Objective and methods We propose a five-step development approach to combine different PRO into one composite measure (PRO-CM): (i) theoretical framework and metric selection, (ii) initial data analysis, (iii) rescaling, (iv) weighting and aggregation, and (v) sensitivity and uncertainty analysis. We evaluate different rescaling, weighting, and aggregation methods by utilizing data of 3145 hip and 2605 knee replacement patients, to identify the most advantageous development approach for a PRO-CM that reflects quality variations from a patient perspective. Results The comparison of different methods within steps (iii) and (iv) reveals the following methods as most advantageous: (iii) rescaling via z-score standardization and (iv) applying differential weights and additive aggregation. The resulting PRO-CM is most sensitive to variations in physical health. Changing weighting schemes impacts the PRO-CM most directly, while it proves more robust towards different rescaling and aggregation approaches. Conclusion Combining multiple PRO provides a holistic picture of patients’ health improvement. The PRO-CM can enhance patient understanding and simplify reporting and monitoring of PRO. However, the development methodology of a PRO-CM needs to be justified and transparent to ensure that it is comprehensible and replicable. This is essential to address the well-known problems associated with composites, such as misinterpretation and lack of trust. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-023-03395-0.


Introduction
The complex, multidimensional nature of healthcare quality makes quality measurement and transparency as well as provider and treatment decisions difficult for patients [1][2][3][4][5]. Patient participation in healthcare decision making presupposes that patients can understand quality information, which requires suitable quality measurement and reporting instruments [3,[6][7][8][9]. Patient-reported outcome measures (PROMs) are promising instruments that, in contrast to clinical indicators, measure patients' own assessment of their current health status and enhance patient engagement [1,4,[10][11][12][13]. PROMs are used to determine patient-reported outcomes (PRO), which are results of longitudinal comparison of individual PROM-scores, i.e., the change in individual PROM-scores attributable to a particular treatment. Despite their potential, the growing number of PROM makes it difficult to easily and comprehensively evaluate outcome quality [2,12,[14][15][16]. Composite measures (CMs) can simplify complex, multidimensional data without sacrificing the underlying power of information [17][18][19].
A CM is a combination of two or more individual measures into one index, which captures multidimensional aspects that cannot be reflected by solely either of the individual measures [18]. In healthcare, CM provides a holistic picture of healthcare quality and can enhance ease of interpretation and comparability [20][21][22]. Next to benchmarking hospital or countries' health system performance, CM can facilitate monitoring recovery paths and outcome quality as well as enhancing public accountability and quality transparency [23][24][25][26]. CM also plays an important role for the emerging value-based healthcare (VBHC) movement and allow researchers to better evaluate the results of clinical studies with several different PRO by a single outcome measure [27,28]. Due to their advantages, healthcare CMs have already been widely applied in many different areas with different purposes [21,22,[29][30][31][32][33][34]. However, there are also important downsides and challenges with CMs, which are controversially discussed in the literature [6,17,25]. Poorly constructed or opaque CMs can be particularly alarming as they have the potential to mask poor quality or deceive those who use them to make important policy and treatment decisions.
It is thus essential that the development methodology is clear and transparent to ensure that the CM is comprehensible and replicable. The chosen methodology is well justified and plausible and represents the relevant quality dimensions without losing or disguising important information [6,[35][36][37]. The development of CM, however, is often controversial, neither is there a goldstandard approach. Some guidelines for CM development are provided, e.g., by the OECD [35,37] or, in a healthcare context, by Shwartz et al. [19]. However, so far CMs are mostly used to aggregate clinical outcomes. Furthermore, there is still a lack of studies that put these guidelines into practice.
In the present study, we combine the different considerations of OECD and Shwartz et al. to develop a patientreported outcome CM (PRO-CM) applicable in routine orthopedic care and clinical studies. We propose a fivestep development approach and highlight the need of transparency and justification of decisions in each step. We evaluate advantages and disadvantages of different rescaling, weighting, and aggregation methods, by utilizing PRO-data of primary hip and knee arthroplasty (PHA and PKA) patients. Due to the increasing case volume of hip and knee arthroplasty worldwide [38,39] and since PROMs are already widely used in this field [40], the orthopedic setting provides a good example for illustrating development and benefits of a PRO-CM. Finally, we identify the most advantageous development approach for a multidimensional orthopedic PRO-CM that is transparent and replicable, combines all relevant sub-dimensions of PHA and PKA, and captures the relative differences and quality variations among these sub-dimensions. It is more sensitive to variations in the sub-dimensions that are most relevant for patients and partly compensates poorer outcomes in one dimension.

Data
We use data from the PROMoting Quality study [41], which provides PRO-data of 3,145 PHA-and 2,605 PKA patients of nine participating German hospitals between 2019 and 2021. Participants were adults undergoing an elective and primary hip or knee arthroplasty with prespecified surgery codes (including total and partial arthroplasties) between 2019 and 2020. Exclusion criteria were emergency and life-threatening cases, ASA classification 4-6, and patients without direct or indirect access to an e-mail account or without a relative supporting the survey PROM response. The randomized-controlled trial was registered at the German Clinical Trials Register under trial number DRKS00019916 and examined the benefit of PROM-based patient follow-up based on the ICHOM standard set for Hip and Knee Osteoarthritis with minor modifications [11,42]: EQ-5D-5L captures Health-related Quality of Life (HRQoL) [43], Hip or Knee Osteoarthritis Outcome Score Physical Function Shortform (HOOS-PS or KOOS-PS) joint-associated problems and functionality [44], analogue pain scales assess pain in hip (left and right), knee (left and right), and lower back [42]. PROMIS Depression Shortform (PROMIS-D-SF) and Fatigue Shortform (PROMIS-F-SF) are included to capture mental health [45]. For a detailed description of the PROM, see Appendix I.

Stepwise method for developing a composite measure
The study was preceded by a literature review on CM in general and in the healthcare context. The development approaches presented here are mainly based on current standards as provided by the OECD [35] and Shwartz et al. [19]. While we consider the OECD guidelines as a general toolkit for relevant technical and methodological issues (e.g., rescaling-and weighting-and aggregation-methods), the framework of Shwartz et al. provides relevant considerations in a healthcare context for creating hospitallevel composites aggregating clinical outcomes. For the PRO-CM, we merge these considerations, adjust them to fit a patient-level orthopedic purpose and propose five PRO-CM development steps: (i) theoretical framework and metric selection, (ii) initial data analysis, (iii) rescaling, (iv) weighting and aggregation, and (v) sensitivity 1 3 and uncertainty analysis [18,19,35]. Assessing risks and benefits of the different options we consider in step (iii) and (iv), we select a priori the most advantageous option with respect to the data structure and theoretical framework (i.e., "Model 1") and compare the results to the other options (Model 2-5).

Theoretical framework and metric selection
The theoretical framework lays the foundation for a CM. It defines the quality construct (i.e., the phenomenon to be measured) and identifies its sub-dimensions [18,35,37]. Relevant quality indicators are identified so as to conform to the quality construct [46]. We select validated and wellestablished generic and disease-specific PROMs that align to the sub-dimensions of the quality construct.

Initial data analysis
We examine the PRO individually to analyze the underlying data structure (e.g., outliers and scale), which guides subsequent rescaling and weighting decisions. We plot descriptive statistics and compute Spearman's rank correlations to check for collinearity [19,24,35]. Following similar studies [24,37,46], we consider indicators correlated higher than r = 0.7 to be merged into one variable to avoid redundancy or preponderance of one particular dimension [18,36].

Rescaling
When indicators have different units of scale, rescaling on a common scale is required to allow comparison and aggregation. Different methods may produce different CM [19,35] and it is not clear which method is favorable. Following Shwartz et al., we compare the two most widely used approaches for healthcare CM, i.e., z-score standardization and min-max normalization [19]. A priori we use z-score standardization (Model 1), as it preserves the relative differences, and extreme values and outliers don't distort the mean but are recognized as exceptional performance. The z-score standardization transforms all individual measures on a dimensionless scale with mean = 0 and standard deviation (SD) = 1. Z-scores express how many SD an individual's outcome is above or below the average of the population and is calculated as: where x is the observed PRO of an individuum, μ is the PRO-mean, and is the SD. See Appendix II for an exemplary rescaling calculation.

Weighting and aggregation
Weights determine the contribution of each PRO to the CM [19,35]. We consider three different weighting options: Equal weighting (EW), differential weighting (DW), and factor analysis (FA). Literature suggests that, without strong justification to use DW (e.g., not all sub-dimensions have the same importance in the quality construct), EW should be applied [19,47]. EW assigns the same weight to all PRO, yielding a CM to which all PROs contribute equally. However, since orthopedic care primarily addresses joint functionality and HRQoL [48], we select a priori DW for Model 1, where physical dimensions and HRQoL receive higher weighting than mental dimensions. Ideally, DW perfectly reflects patient preferences which could be determined in a patient survey [19]. Since this exceeds the scope of this study, we approximate importance by each PROM-score's improvement: The more a PROM-score has improved 12 months post-surgery, the higher its importance. The corresponding weights are determined by measuring the improvement of each sub-dimension in standard deviation units and calculating its proportion of the total sum of all improvements. Appendix III entails more detailed considerations of different weighting methods. Aggregation combines the weighted individual PRO into the final PRO-CM. We consider a compensatory and a noncompensatory aggregation method. A priori we use additive aggregation (Model 1), a compensatory method where worse outcomes can be counterbalanced by better outcomes. Since both surgery and recovery process differ between PKA and PHA, two treatment-specific composites are generated. They are computed as: where CM i is the CM for treatment i , w j is the weight of the jth rescaled PRO I j .

Sensitivity and uncertainty analysis
In the sensitivity analysis, we calculate Pearson's correlations between the resulting CM and the individual PRO to determine the PRO-CM's sensitivity to quality variations among the sub-dimensions, i.e., the responsivity of the PRO-CM to changes in its sub-components. In the uncertainty analysis, we compare the results of models 1-5 to examine the impact of decisions in the chosen development approach and to analyze the associated uncertainties. For this, we convert the results of each model, in each of which we alter one decision, into patient rankings to illustrate the impact of altering a decision in the development process on the final result of a patient. The patient with the highest CM value gets assigned rank 1, w j I j the second highest rank 2, and so on. Patient rankings of our selected approach (Model 1) are compared to four alternative models (see Table 1). The greater the scatter between two compared models, i.e., the more the rankings of patients change depending on the model, the greater the impact of the corresponding changed development method. Models 2-5 are constructed as follows: Model 2 Rescaling PRO with min-max normalization method. Min-max normalization transforms the data's original range to a common range from 0 to 1. It is calculated as: where x is a PRO of an individuum, min(x) is the minimum PRO, and max(x) is the maximum PRO. Min-max normalization is more sensitive to outliers and can distort relative differences and mean values. However, due to a clearly defined boundary range, it has an intuitive appeal and strong interpretative power [19]. Also, when PROs are within a small interval, the range can be expanded to increase the effect on the CM [35].
Model 3 Applying EW where all PROs contribute to the CM with the same importance. It is considered as the easiest strategy to implement, and it is not subject to any special interests and easily replicable by others [36,49].
Model 4 Using FA to derive weights statistically. The weight of each PRO is relative to the amount of variance in common with other PRO. An approach which is resistant to potentially intentional manipulation and often applied when a great amount of indicators exist [50][51][52].
Model 5 Using geometric aggregation, a non-compensatory multiplicative approach that prevents poor outcomes from being compensated by good outcomes. It is computed as: where CM i is the CM for treatment i , w j is the weight of the jth rescaled PRO I j [35,49,53].

Theoretical framework and metric selection
The PRO-CM is specific to PHA and PKA. It aims to reflect a multi-faceted picture of post-arthroplasty improvement in health as reported by patients, hence, does not entail clinical outcomes. Improvement in health (i.e., the PRO) is defined as PROM-score difference between hospital admission (HA) and the 12-month follow-up (12FU). To capture all patient-relevant aspects of post-arthroplasty improvement, we outline three main sub-dimensions of the PRO-CM. Those are general HRQoL (EQ-5D-5L) [43,54], physical health (HOOS-PS, KOOS-PS, pain scales) [42,44,54], and mental health (PROMIS-D-SF, PROMIS-F-SF), as practical experience of healthcare experts and literature suggests that, although arthroplasty primarily addresses physical health, also mental health has a significant influence on patient recovery and is not sufficiently covered by EQ-5D-5L [41,45,48,55,56]. See Table 1 in Appendix I (Electronic Supplementary Material) for the PRO-CM dimensions and its sub-components. Table 2 shows summary statistics for hip and knee PROMscores at HA and 12FU. EQ-5D-5L has mean of 0.62 While Pain-OJ shows relatively high improvement for PKA (PHA) from 6.8 (6.5) at HA to 1.9 (1.1) at 12FU with a possible range from 0 to 10, Pain-Other is at a comparatively low level at HA and barely shows change during the recovery. Since neither PKA nor PHA appears to influence Pain-Other, this score is excluded.  ]. Compared to PKA, PHA patients improve more during recovery in either dimension as they report worse PROM-scores at HA and better PROM-scores at 12FU. This is most evident in physical health, but also visible in HRQoL and mental health. Outliers exist for all PROM, with most extreme values of KOOS-PS (HOOS-PS). We found correlations between PROM albeit weak ones. EQ-5D-5L, which comprises mental health and pain sub-dimensions, is only weakly correlated (r ≤ 0.5) with mental health and pain. Since none of the correlations is > 0.7, each PROM has sufficient independent explanatory power to the purposes of this study.

Rescaling
As a third step, we rescale via z-score standardization and compare it to min-max normalization (see Table 3). After z-score standardization, each PRO has mean = 0 and SD = 1. For equal directionality and an intuitive interpretation, each PRO is rescaled so that a higher value indicates more improvement. Values above 0 indicate more improvement  than average in units of SD and vice versa. Upper and lower bounds can take (theoretically) infinite values, with values beyond ± 3 usually considered to be outliers. Min-max normalization transforms all PRO onto the same scale from 0 to 1 (Model 2). Since especially negative outliers are present, most normalized PROs have mean values greater than 0.5, indicating how min-max normalization is affected by outliers. Caution must be exceeded in interpretation as the worst PRO defines the lower boundary and a normalized value of 0 can indicate PRO-deterioration.

Weighting and aggregation
The initial data analysis shows physical health dimensions to improve the most, followed by HRQoL and mental health dimensions. Consequently, for Model 1, estimated weights are 0.3 for each physical health sub-dimension, 0.2 for HRQoL, and 0.1 for each mental health sub-dimension [for a more detailed description, see Table 1 in Appendix III (Electronic Supplementary Material)]. This is in line with our assumption that physical health should be assigned more importance than mental health. Contrarily, EW assigns the same weight to each PRO, i.e., 0.2 (Model 3), while FA (Model 4) derives the weights statistically and assigns more weight to mental health. Figure 1 shows the boxplots of the five resulting PRO-CM models after aggregation of the weighted indicators.
The PRO-CM in Model 1 has a mean of 0 and SD of 0.73 for both PHA and PKA patients. Like the z-scores, it can take theoretically infinite values. Patients take values between ± 2 while PHA patients show more negative outliers with less than -3. Model 2 yields a CM with mean of 0.57 (0.60), SD of 0.09 (0.1), and a range from 0.25 to 0.89 (0.1 to 0.95) for PKA (PHA) patients. Model 3 shows a similar mean and SD as in Model 1, however, slightly contracts the range for PKA patients while expanding the range for PHA patients. Model 4 in general yields a higher SD and larger range and more extreme outliers for PKA and PHA patients with both having a mean of 0. Lastly, Model 5 has mean of 0.56 (0.59) and SD 0.09 (0.1) for PKA (PHA) patients with minimum values of 0, where at least one PRO was equal to 0.

Sensitivity and uncertainty analysis
The sensitivity analysis shows that, although in Model 1 the weights for pain-OJ and KOOS-PS (HOOS-PS) are equal, there are minimal differences in the sensitivity of the PRO-CM to variation in these PRO. Correlations (see Table 4 This is similar in Model 2. However, the min-max normalization leads to pain-OJ becoming the largest contributor for changes in the PRO-CM, whereas it becomes somewhat less sensitive to KOOS-PS (but remains stable for HOOS-PS). Yet, this CM remains most sensitive to changes in physical health dimensions, followed by changes in HRQoL and finally in mental health dimensions. The correlations are more balanced in Model 3, with slightly higher sensitivity  Figure 1 shows the PRO-CM (Model 1) and the four alternative development models: Model 1: z-score standardization, differential weights, additive aggregation; Model 2: min-max normalization, differential weights, additive aggregation; Model 3: z-score standardization, equal weights, additive aggregation; Model 4: z-score standardization, factor analysis, additive aggregation; Model 5: min-max normalization, differential weights, geometric aggregation. The left box of each panel shows results for PHA, the right side for PKA patients. Z-Score standardized CM have a scale from − 5 to 4 while minmax normalized CM have a scale from 0 to 1 to changes in KOOS-PS (HOOS-PS) and HRQoL than to changes in pain-OJ and mental health dimensions. Model 4 results in a CM that is most sensitive to changes in HRQoL and KOOS-PS (HOOS-PS). Mental health dimensions gain importance, while pain-OJ has the weakest correlation.
Lastly, Model 5 shows very similar results to the additive approach in Model 1.
Results of the uncertainty analysis are illustrated in Fig. 2, which shows the relation between the PRO-CM in Model 1 and the four alternative models. The y-axis represents Model 1 patient rankings and the x-axis patient rankings of   Figure 2 shows the relation between the PRO-CM (Model 1) and the four alternative development models: Model 1: z-score standardization, differential weights, additive aggregation; Model 2: min-max normalization, differential weights, additive aggregation; Model 3: z-score standardization, equal weights, additive aggregation; Model 4: z-score standardization, factor analysis, additive aggregation; Model 5: min-max normalization, differential weights, geomet-

Discussion
In this study, we have proposed a development approach of a patient-centered PRO-CM for PKA and PHA patients and compared it to four alternative models. The PRO-CM is robust towards different aggregation and rescaling methods, while applying different weighting schemes can have a greater impact on the final result. We consider the approach with z-scores, DW, and additive aggregation as most advantageous with respect to the data properties and the theoretical framework (Model 1). Z-scores do not distort the mean by preserving the relative differences and extreme values are acknowledged as exceptional performance, while min-max normalization (Model 2) is heavily affected by outliers [35]. DW assigns more importance to physical health dimensions that play an important role in PKA and PHA recovery [48]. EW (Model 3) should be applied when there is no strong justification to apply DW, while FA (Model 4) is rather suitable when a great number of different indicators are combined to one score [50,52]. Additive aggregation allows, to some extent, poor outcomes to be compensated by good outcomes. In some cases, depressive symptoms were already at a low level and thus an improvement of 0 took place. With noncompensatory aggregation (Model 5), this would lead to a final CM value of 0 despite a very large improvement in physical dimensions. As shown in the sensitivity analysis, the PRO-CM is capable of measuring relevant quality variations among sub-dimensions. The information from the individual PRO is still contained, but for outcome comparisons, only one metric must be considered instead of many different metrics. The PRO-CM can therefore empower patients, as it simplifies the monitoring of their recovery and enables them to make meaningful provider and treatment choices through enhanced comprehensibility [21,23]. Physicians can track their patients' recovery and quickly respond to health deteriorations with treatment adjustments [25,26]. It is also eligible for public reporting, since assessing and ranking provider performance is facilitated [2,3]. Reducing the outcome-side of any cost-benefit consideration to one-multidimensional metric also might aid health policy decisions, whether to calculate and present the costeffectiveness of new forms of treatments, or to determine patient-value in the emerging VBHC considerations [27, 28,58].
As with any CM, there are some specific and some more general limitations [6]. First, since z-scores have no clear boundaries, interpretation of z-score-based CM is difficult and not intuitive. Interpretability and comprehensibility can be enhanced by transforming the PRO-CM, e.g., to a scale from 0 to 100 (T-score transformation). Other possible approaches, such as ranking or 5-star classification, have been excluded in advance, as these methods entail a loss of information [19,35]. However, intuitive visualization formats are highly relevant for the presentation of health data, such as the PRO-CM, and need to be discussed in a separate study [57]. Next, ideally DW perfectly reflects the preferences of patients [19]. Approximating preferences from PRO is a strong assumption and is certainly not the same for all patients. However, without knowing the true preferences, it is difficult to evaluate otherwise. Further, in this study, a complete dataset without missing values from a clinical study was used. However, in most datasets, missing values are present for which appropriate imputation methods must be applied to avoid selection bias [35]. Lastly, we illustrated the benefits of a PRO-CM with available data from the PROMoting Quality study. For broad application and realizing full potential, cross-clinic PRO-data must be available nationwide. This underlines the urgency of advancing broader PRO-measurement and usage along the patient pathway, which, at least in Germany, is still in its infancy [5]. As is, the PRO-CM developed here will primarily be applied in the evaluation of clinical trials.
Generally, opaque construction methods or individual components of poor quality can cause misinterpretation and, hence, mislead patients or trigger overly simplistic treatment, management. or policy decisions [19,24]. When the construction methodology and its robustness are not transparently displayed, CM can easily and intentionally be skewed [6]. They can be misused for individual goals and purposes if intentionally formed for specific desired policies. It can lead to disguising very poor performance in one dimension by better performance in another and, hence, complicates the task of making targeted interventions to improve individual dimensions [6,17]. Since a specific weighting of the underlying indicators is applied, conflicts might appear with different preferences of patients and admitting physicians [3,59]. Although the threats and problems are widely known, CMs are often presented without going into more detail about the development process [6]. In this study, we addressed these problems and enable replicability by justifying each step in the development.

Conclusion
We provide a transparent, stepwise development approach for a multidimensional PRO-CM that can effectively capture quality variations in orthopedic surgery. Combining multiple PRO provides a simplified but holistic picture of patients' health status while single PRO only provides information about a specific dimension. By reducing information overload, using a PRO-CM can enhance the benefits of quality transparency. However, to avoid misleading of policy, treatment, or provider decisions, the development methodology of a PRO-CM, as presented here, needs to be justified and transparent to ensure that the composite is comprehensible and replicable. Only in this way can the known problems of CM be counteracted and their full potential unfolded, which should serve one thing above all else, the promotion of quality in healthcare.