Introduction

Patients presenting with an ankle fracture is a common sight in the emergency department. A study demonstrated that approximately one of ten sustained fractures in patients older than 11 years are due to ankle fractures [1]. An epidemiological study on ankle fractures of the entire population in the United States estimated 673,214 cases over a period of five years, giving a incidence rate of 4.22/10,000 person years [2]. Ankle fractures occurs in all ages and both genders, but with a bimodal distribution curve, with the first peak in young men, and a second peak in older women [1]. The link between an increased risk of ankle fractures in the elderly population and a reduction in bone mineral density has been established [3], indicating that ankle fractures in the older female population are considered a predictor for fragility. With increases in life expectancy, it is likely that the frequency of fragility ankle fractures will also rise in the future [4]. Presumably, this will have implications for the management of ankle fractures, considering the challenging nature of fragility fractures and the increasing complexity of patients’ clinical status as they age [5, 6]. With such a heterogeneous patient population and enhanced focus on patient-specific treatment, treatment approaches also differs largely. The estimated cost of surgically treated ankle fractures per patient was $8688–20,414 (2016 USD), with a mean duration of unemployment of 53–90 days [7]. Alongside this trend within the field of orthopedic surgery, there is a need for more accurate outcome measures, reflected in the increased use of patient-reported outcome measures (PROMs) in clinical and research settings in the last decade [8, 9].

A patient-reported outcome (PRO) is defined as “any report of the status of a patient’s health condition that comes directly from the patient without interpretation of the patient’s response by a clinician or anyone else”, and PROMs are the instruments used to measure PROs [10]. The measurement properties of an instrument provide information on the validity, reliability and responsiveness of the instrument in the context of use, and content validity is considered the most important aspect [11]. A recent review identified the Olerud-Molander Ankle Score (OMAS) as the most commonly used primary outcome in the assessment of patients with ankle fractures in clinical trials [12]. The American Orthopedic Foot and Ankle Score (AOFAS), which is considered a partially patient-reported outcome measure, was the fourth most commonly used outcome score for ankle fracture patients. Other reviews found that the AOFAS was the most commonly used instrument in foot and ankle disorders [13, 14], regardless of repeated concerns with its measurement properties [15,16,17]. However, the quality of the instrument relies upon on the measurement properties and should be the main concern when choosing the outcome measure in research and for clinical use [9]. A recommendation on which PROM should be used in patients with ankle fracture based on current evidence on measurement properties is warranted.

This systematic review assesses the evidence for the measurement properties and the interpretability of PROMs used in the evaluation of adult patients with ankle fractures and adheres to the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guidelines [11, 18, 19]. It takes into consideration the limitations from previously published systematic reviews [20, 21] by including validation studies of all PROMs and studies in a population mainly composed of ankle fracture patients. This will ensure an adequate representation of the target population and provide a more complete overview of the PROMs validated for use in this context.

Methods

Protocol and registration

The reporting of this review followed the checklist provided in the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statement [22, 23]. The protocol has been submitted to the International Prospective Register of Systematic Reviews (PROSPERO) (registration number: CRD42019122800).

Eligibility criteria

Studies that assessed the measurement properties of PROMs in an adult ankle fracture population with the Arbeitsgemeinschaft für Osteosynthesefragen/Orthopaedic Trauma Association (AO/OTA) classification 44 [24], including medial malleolar fracture, were selected for the current systematic literature review. The included studies involved a study population of at least 50% patients with ankle fractures.

The exclusion criteria were as follows: (1) articles in languages other than English or a Scandinavian language; (2) validation of a PROM against a non-PROM instrument, as these studies provide only indirect information on the measurement properties; and (3) proxy-reported PROMs, as these were considered observer-reported outcomes [10].

Data sources and search strategy

A literature search was performed in Medline, EMBASE and CINAHL from the inception of the databases to the 6th of July 2021. Three filters were applied: (1) a PROM-inclusion filter developed by the University of Oxford [25], (2) a validated sensitive search filter for measurement properties by Terwee et al. [26] and (3) an age filter to exclude results indexed with child and adolescent age groups only. A separate search in Google Scholar was performed with the following search phrase: “ankle fracture” validation “patient reported outcomes” “measurement properties. The search strategy was devised in collaboration with expert research librarians and details are presented in Online Resource 1.

Selection process

The review team consisted of four reviewers. The results from the search strategy were uploaded to Covidence [27]. All titles and abstracts were randomly screened for potential eligibility by two reviewers independently. Any disagreements were discussed between the two reviewers, and if in doubt, the full text was retrieved. The full text was retrieved for all abstracts that were potentially eligible for inclusion and again independently screened by two reviewers. Any disagreements were discussed between the two reviewers, and if consensus was not achieved, a third reviewer was consulted.

The initial screening included screening PROMs used in a more general fracture population. Two reviewers separately performed a secondary final screening of the included articles to retrieve studies limited to those meeting the eligibility criteria for the ankle fracture review.

The first author screened the references of the included articles for potential eligible studies.

Data extraction

The extracted outcome variables were (1) content validity, including PROM development; (2) structural validity; (3) internal consistency; (4) cross-cultural validity/measurement invariance; (5) reliability; (6) measurement error; (7) criterion validity; (8) hypothesis testing for construct validity; (9) responsiveness; and (10) interpretability.

Assessing the methodological quality of the studies

The COSMIN Risk of Bias checklist [19] was applied for the assessment of the methodological quality of the studies. The list contained questions for each measurement property, and each question was given a rating of very good, adequate, doubtful or inadequate. The overall rating for each measurement property per study followed the “the worse score counts” principle.

Ratings of PROM development and content validity

PROM development was not considered a measurement property but was taken into account in the assessment of content validity and consisted of (1) PROM design, which accommodates concept elicitation and item generation, and (2) testing of the new PROM, which refers to a cognitive interview or a pilot study. It was a prerequisite when rating the PROM development that the methodological quality did not have an inadequate rating when rating the results against the criteria for good measurement properties.

The evaluation of content validity included three aspects: (1) relevance, (2) comprehensibility, and (3) comprehensiveness. For translations, only the comprehensibility aspect was assessed. Each aspect was rated sufficient, insufficient or indeterminate. PROMs that included the target population for the current review in the development phase were also given a content validity rating by the reviewers.

The results from the development study, content validity studies and reviewers’ ratings were summarized, and an overall rating of sufficient, insufficient or inconsistent was obtained based on the criteria for good content validity [11].

Rating of the remaining measurement properties

The remaining measurement properties were assessed according to the COSMIN criteria for good measurement properties [18], resulting in a rating of sufficient, insufficient or indeterminate per study. Subsequently, the results from all studies on each measurement property were summarized and again rated against the COSMIN criteria for good measurement properties to yield an overall rating of sufficient, insufficient, inconsistent or indeterminate. In the assessment of the methodological quality of studies, twenty percent of the included articles were randomly selected to be independently assessed by two reviewers. Any disagreements or difficulties in ratings were discussed to achieve consensus. If this was not reached, a third reviewer was consulted.

The review team agreed that there are no gold standards in the evaluation of construct validity, except when comparing a shortened version against its original version [28]. Rather, hypotheses were formulated for the validation of construct validity. As there were no limitations to which PROMs were included in this review, it was not feasible to define hypotheses for every possible scenario a priori. Instead, threshold categories for correlations and a ground set of hypotheses were constructed (Online resource 2) [29]: instruments measuring (1) the same construct were expected to have at least moderate to high correlation (r > 0.6), (2) moderate correlation for related constructs (0.3 < r < 0.7), and (3) weak to moderate correlation for weakly related constructs (0.2 < r < 0.4). More specific hypotheses were formulated throughout the review with the expected direction and magnitude of the correlation depending on the construct of each instrument (Online Resource 3).

A similar approach was used in the assessment of responsiveness, but hypotheses were formed based on the expected correlation between the change scores of the instruments. The threshold categories for correlation was expected to be lower for change scores when compared to the scores of instruments at a single time point [30]. When the comparator instrument measure the same construct as the instrument under study, the correlation was expected to be high (r ≥ 0.5). If the comparator instrument measure a related construct, the correlation was expected to be moderate (0.3 < r < 0.5). For external measures with a dichotomous variable, an area under the curve (AUC) of 0.7 or above indicated sufficient ability of the instrument to discriminate between patients who improved and patients who did not improve according to the external measure of change.

Interpretability

Interpretability is not considered a measurement property but refers to “the degree one can assign qualitative meaning” to the PROM score or change in PROM score [31], and is used as additional information when choosing the instrument. Data on distribution of scores, rate of missing items, floor/ceiling effect and minimal important change (MIC) were extracted.

Quality of evidence

The modified Grading of Recommendations, Assessment, Development and Evaluation (GRADE) approach [18, 32] was applied to the summarized results to yield a grading for the quality of evidence. This grading expresses the level of certainty for the summarized results. Each measurement property received a grading of high, moderate, low or very low depending on four factors: (1) risk of bias; (2) inconsistency in the results across studies; (3) imprecision, which referred to the total sample size; and (4) indirectness, i.e., if evidence was derived from different populations or from the context of use.

Recommendations

PROMs in category A are recommended for use in the evaluation of patients with ankle fractures. These PROMs exhibit evidence for sufficient content validity and at least low quality evidence for internal consistency. If there is good evidence for an insufficient measurement property, the PROM is disapproved for use and placed in category C. The remaining PROMs are placed in category B. These could be recommended by obtaining more evidence on sufficient measurement properties with further validation [18].

Results

Study selection

Of the 8339 potential articles for this review, 3107 duplicates were identified and removed before screening commenced. The titles and abstracts of the remaining 5232 articles were screened for eligibility, and 4531 articles were excluded. In the next step, 696 full-estext articles were screened by the inclusion/exclusion criteria, and 680 articles were excluded. Five articles were included from the screening of references in the included articles [33,34,35,36,37], one article was included based on a Google Scholar search [38], and one article [39] was included based on a systematic review [21]. In total, 23 articles were included in the review (Fig. 1).

Fig. 1
figure 1

Template from: Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021;372:n71. https://doi.org/10.1136/bmj.n71

PRISMA flow diagram for the search strategy and selection of records.

Study characteristics

Thirteen PROMs were identified (Table 1) and characteristics of PROMs under study are reported in Table 2. Comparator instruments identified in the studies are described in online resource 4. In the 23 articles included, 28 studies were described. For some of the articles, multiple studies were described assessing different measurement properties in the same article. Eleven studies included only surgically treated ankle fractures. Patient ages ranged from 16–94, with a mean age 41–58. Follow-up times ranged from one month to five years (Table 3).

Table 1 Included PROMs
Table 2 Characteristics of included PROMs
Table 3 Characteristics of the included studies

Measurement properties

One article assessed the measurement properties of several PROMs [40]. Most of the studies assessed multiple measurement properties. No studies assessed cross-cultural validity/measurement invariance or criterion validity. The measurement properties of the OMAS were most frequently assessed. Table 4 presents the results from the Ankle Fracture Outcome of Rehabilitation Measure (A-FORM) and the three most validated instruments. A summary of findings table for all PROMs is presented in Online Resource 5.

Table 4 Summary of findings tables for the A-FORM, LEFS, OMAS and SEFAS

PROM development and content validity

Only the A-FORM [41] had a methodologically adequate PROM design, but a lack of cognitive interviews or pilot studies yielded an inadequate rating for methodological quality regarding the total PROM development. The Trauma Expectation Factor Trauma Outcome Measure (TEFTOM) [42] was rated as having inadequate methodology in both PROM design and pilot study measures. Due to inadequate ratings regarding total PROM development, the ratings of both instruments were based on reviewers’ ratings only and achieved the lowest level of evidence.

Three studies included translations [35, 36, 43] and were assessed for comprehensibility as part of the content validity study but were not given a total content validity rating or quality of evidence grading due to lack of validation on the relevance and comprehensiveness aspects.

Structural validity

One study performed confirmatory factor analysis (CFA) on the OMAS, Self-reported Foot and Ankle Score (SEFAS) and Lower Extremity Functional Scale (LEFS) [40], and each met the criteria for a sufficient rating of structural validity (Table 4). However, two other studies demonstrated lack of unidimensionality for the LEFS with Rasch analysis [37, 44]. In addition, exploratory factor analysis (EFA) was performed to explore the dimensionality of the OMAS [45] and LEFS [34], and two subscales were identified in both instruments. The summarized results for the LEFS are conflicting, and the level of evidence was not graded. As the COSMIN guidelines do not define criteria for EFA, the result of these studies did not receive a rating.

Internal consistency

Summarized results from several studies of very good methodological quality yielded a Cronbach’s alpha of 0.76–0.84 and 0.93 for the OMAS and SEFAS, respectively, indicating sufficient internal consistency. Internal consistency parameters were reported for the LEFS and Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), but these were not rated due to a lack of evidence for sufficient structural validity.

Reliability

The LEFS achieved high quality evidence for sufficient reliability, supported by two studies with adequate methodological quality reporting intraclass correlation coefficient (ICC) of 0.91–0.93 (Table 4). The OMAS, SEFAS and Visual Analogue Scale Foot and Ankle (VAS-FA) had moderate evidence for sufficient reliability, while the TEFTOM and Munich Ankle Questionnaire (MAQ) had low and very low evidence for sufficient reliability, respectively.

Measurement error

The summarized result of the smallest detectable change (SDC) for the OMAS was 9.1–19.0, One study of inadequate methodological quality reported a value of 9.1 and was less decisive for the overall rating. The remaining studies reported values of 12.0 and 19.0, which was higher than the MIC of 9.7 points reported by McKeown et al. This indicated that the instrument cannot separate an important change (from the patients’ perspective) from measurement error between these values, resulting in an insufficient rating. The quality of evidence was downgraded to very low for three reasons: (1) presence of only one study assessing the MIC, (2) only one study of adequate methodological quality, and (3) indirectness due to considerable differences in follow-up times (16 weeks and 4.3 years) (Table 4).

Two studies on the SEFAS provided SDCs between 6.6 and 6.8 [40, 43], with one exhibiting inadequate methodological quality due to lack of stability between measurement points. Both studies reported SDCs to be higher than the MIC based on five points reported in the study by Erichsen et al. [43], but the calculation of this MIC carries a considerable risk of bias due to low sample size and inconsistency in the change score across subgroups.

The LEFS lacked MIC reporting and could not be rated according to the criteria for good measurement properties.

Hypothesis testing for construct validity

The OMAS, WOMAC, LEFS, SEFAS and MAQ had 75% or more confirmed hypotheses. Validation of the Finnish version of the VAS-FA was not a clearly defined construct, and the study was rated as having inadequate methodological quality. The TEFTOM and PROMIS-PF CAT were the only instruments with an insufficient overall rating, however, the quality of evidence was low.

Responsiveness

The LEFS achieved a sufficient rating with two confirmed hypotheses (Table 4). The authors used an external measure but did not specify the question that resulted in a downgrading of the level of evidence. Regarding the MAQ, three hypotheses were confirmed based on the construct approach, correlating the three domains to the same GRS and yielding a sufficient rating with moderate quality of evidence.

Interpretability

The MICs of the OMAS and SEFAS were 9.7 and five points, respectively. The latter included a small sample size of 39 patients, and the data did not present a gradual increase in change scores among patients who improved, which introduces risk of bias in the determination of this value.

A floor effect of 22.4% was reported with the SEFAS at the six-week follow-up. A ceiling effect was reported for the OMAS (17%) and LEFS (27–29%), where both studies had a follow-up time of more than four years (Online Resource 6).

Discussion

Summary of evidence

A recent review of PROMs used as primary outcomes in interventional trials for patients with ankle fractures [12] identified the OMAS as the most commonly used multi-item PROM. In a systematic review assessing measurement properties of PROMs used in foot and ankle disease, the Manchester-Oxford Foot Questionnaire (MOXFQ) was reported to have the best overall psychometric properties [46]. However, the current review illustrates that the MOXFQ is completely absent in validation studies for the ankle fracture population. Collectively, there is still a lack of studies covering all measurement properties of PROMs for patients with ankle fractures. Among the PROMs used in the evaluation of the ankle fracture population, the measurement properties and interpretation of the OMAS, LEFS and SEFAS were most studied. However, there is a consistent lack of validation of the most important measurement property, i.e., content validity, reflecting the uncertainty in covering all aspects of a given construct. Thus, none of the PROMs could be categorized in category A.

Validity and reliability

The OMAS was most frequently assessed PROM in this study population but was missing a content validity study of good quality. Despite inadequate methodology in PROM development, subsequent content validity studies could provide evidence for sufficient content validity. Of the instruments included in this review, only the A-FORM had an adequate PROM design [41]. The design and concept elicitation were based on a qualitative study on the life impact of ankle fractures [47] but lacked cognitive interviews or pilot tests. The developers of this instrument complied with many of the crucial steps in the development phase of a PROM, rendering a sound foundation for subsequent validation studies. The TEFTOM [42], on the other hand, had severe flaws in the development phase, where the study population was limited to fractures of the ankle and distal tibia. Such a limitation will not suffice to provide an adequate representation of the instrument’s intended population of general trauma patients.

In regard to structural validity, CFA is preferred over EFA for testing existing factor structures [48]. The OMAS and SEFAS appeared to be unidimensional when assessed with CFA [40]. However, the OMAS was also assessed with EFA [45], and two subscales were found, namely, (1) ankle function and (2) ankle symptoms, which indicates a bifactor structure in this instrument.

The LEFS was also assessed with CFA and achieved a sufficient rating of structural validity, but data from the same study showed a better fit with a bidimensional structure [40]. Another study validating the Finnish version of the instrument also found a two-factor structure [34]. Lin et al. [44] performed a Rasch analysis of the LEFS at three different time points. Most of the items were within the acceptable range for goodness-of-fit, but one item (sitting for 1 h) had unacceptable outfit statistics at all time points. The article did not provide enough information for a rating based on criteria for good measurement properties [49], but the Rasch analysis showed a lack of items for patients with greater abilities, drawing attention toward the cautious use of the instrument in patients with high demands or long-term follow-up of ankle fractures. Another Rasch analysis of the LEFS demonstrated disordered item thresholds for the response categories [37]. These studies had a methodological quality of at least adequate rating, but the results were conflicting. No obvious separation of studies into subgroups was identified that could explain the discrepancies. If this instrument was to be used in an ankle fracture population, one should be wary of the possible lack of unidimensionality.

Reliability and measurement error are usually assessed with a test–retest study. Often, the measurement error of an instrument is neglected, and reliability is tested only by providing an ICC. However, the assessment of measurement error, together with MIC values, permits another dimension to the interpretation of the statistical and clinical meaning of the scores. In the current review, the OMAS, SEFAS and LEFS displayed good evidence of sufficient reliability. Measurement error parameters for these instruments were reported, but the lack of MIC values for these instruments in the ankle fracture population made the interpretation incomplete. As an example, only one study reported the MIC for the OMAS [45]. When evaluated together with two other studies that reported an SDC larger than the MIC [40, 50], this implied that the OMAS cannot discriminate between a clinically important change from a measurement error of the instrument when the scores are between these two values. The quality was rated very low due to considerable risk of bias, but it still signifies the importance of reporting the measurement error and MIC.

In the assessment of subjective outcome measures such as PROMs, one can hardly declare an instrument to accommodate nearly perfect validity and reliability, hence the reluctant use of the word “gold standard”. In situations where PROMs are compared to each other, hypotheses are formed based on the assumed construct of each instrument while simultaneously acknowledging the current evidence on the comparator instruments’ measurement properties. The hypothesis testing of construct validity perhaps provides the least information regarding the validity of the application of an instrument since the method depends on the measurement properties of the comparator instruments and on the inquiring hypotheses postulated by the reviewers. However, acquiring evidence on this measurement property is a continuous process, and with growing empirical evidence, demonstration of construct validity is achievable through the process of probing hypotheses. In the current review, the OMAS was subject to the most hypothesis testing, with nine articles of varying methodological quality assessing construct validity, resulting in 75% confirmed hypotheses. The LEFS also had multiple studies of at least adequate methodological quality assessing construct validity with hypothesis testing, resulting in 87% confirmed hypotheses.

Limitations

When methodologically adequate studies are missing in the assessment of content validity, the reviewers’ rating remained the only rating. Depending on the reviewers’ level of knowledge and experience, this can introduce bias in the assessment. Likewise, for the definition of hypothesis testing for construct validity, the categorization of expected correlations was discussed and agreed upon within the review team, but this might differ for other reviewers.

Seven articles were not included by the main search. Four of the articles did not include the word “fracture” in their title, abstract or keywords [34, 35, 37, 39]. They were also not indexed with subject headings for ankle fractures. The remaining three articles were excluded due to lack of terms or phrases found in the PROM-inclusion filter developed by the University of Oxford [33, 36, 38].

Conclusion

None of the PROMs included in this study received a category A recommendation due to lack of evidence on sufficient content validity and internal consistency. In addition, none of the PROMs had good evidence on an insufficient measurement property, leaving category C empty. Therefore, all PROMs included in this review were assigned to category B. Due to the lack of PROMs in category A, the OMAS, SEFAS and A-FORM received a temporary recommendation of use for evaluative purposes in the ankle fracture population pending additional evidence. Further research should focus on conducting high quality content validity studies for the PROMs used in this context. There is also a significant need for more empirical evidence on the remaining measurement properties of the A-FORM.