Introduction

Assessing the outcome following orthopedic surgery remains a hot topic of debate. In general, the outcome is frequently assessed by imaging, range of motion (ROM), or patient-reported outcome measures (PROMs). In recent years, there has been an increasing demand for PROMs, both from scientific committees and governments. This stays especially true for elective foot and ankle surgery, as it has to show its efficacy to both, the patient and the insurance provider.

In a previously published living systematic review, the authors assessed the outcome following hallux valgus surgery [1]. Despite a considerable number of eligible studies, all of which had a level of evidence of I or II, a meta-analysis could only be conducted for the HVA, IMA, and AOFAS. This dramatically highlights the grossly missing standardization of study protocols in foot and ankle surgery. In the initial study, the authors did not conduct a formal analysis of all the PROMs assessed in the different studies.

Therefore, the aim of the current study was to reanalyze the studies included in the living systematic review per the chosen patient rated outcome scores. The results were discussed to identify a possible standard set of outcome measures for hallux valgus outcome research.

Materials and methods

Study selection and data extraction

The study was based on a previously published living systematic review and is part of the current revision process of the German guidelines for hallux valgus surgery (033-018). The review was registered a priori (Prospero #CRD42021261490), conducted per the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA-P) guidelines [2] and the PICOS criteria [3]. Four common databases (Medline (PubMed), Scopus, Central and EMBASE) and the grey literature were searched from 01/01/2012 to 01/31/2023. Prospective studies comparing either two surgical procedures or one surgical procedure for different stages of hallux valgus deformity were included. The whole study selection-, data extraction- and assessment-process was conducted by two independent reviewers (SE, SFB). Disagreement at any stage was resolved by discussion with a third reviewer (HP).

Data assessment

The level of evidence was rated per the recommendations of Wright et al. [4] and risk of bias was assessed by the Risk of Bias 2 (RoB 2) tool [5] or the Newcastle–Ottawa scale [6], where appropriate. The primary outcome parameter assessed was the PROMs used in each individual study. PROMs included the visual analog scale for pain (VAS), clinician-based outcome scores, and any quality-of-life (QOL) score. These were analyzed descriptively per their frequency, the level of evidence [4], the type of osteotomy performed, and the quality of the journal (i.e. impact factor) in which the study was published.

Results

Study selection

Figure 1 depicts the study selection process. 3022 studies were screened for title and abstract and 378 for full-text. Finally, 46 primary studies [7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52] were enrolled for qualitative analysis consisting of 40 studies comparing different surgical procedures [7, 9,10,11,12,13,14,15,16,17,18,19,20, 22,23,24,25,26, 28, 30,31,32,33, 35,36,37,38,39,40,41,42, 44,45,46,47,48,49,50,51,52] and six studies comparing the same surgical procedure for different severities of HV deformity [8, 21, 27, 29, 34, 43]. 30 studies were RCTs (RoB2: 2 × high risk, 28 moderate risk) and 16 non-randomized comparative studies (Newcastle–Ottawa-Scale: 6 ± 1 points ≙ moderate risk).

Fig. 1
figure 1

PRISMA flow chart. n: number of studies

Overall, only eight different outcome measures were used in the studies. The individual PROMs per the studies’ level of evidence, in descending order, and per the surgical technique are outened in Table 1. The two most used clinical outcome measures were the AOFAS (55%) and the VAS (30%). The remaining six PROMs were used in less than 5% of the studies each. No correlation could be found between the frequency of the individual scores and the level of evidence or the type of osteotomy. Also, the usage of PROMs per the various journals and the corresponding impact factors of the publishing journals did not differ (Table 2).

Table 1 Outline of the included PROMs per level of evidence and the type of osteotomy
Table 2 Outline of the included PROMs per the different journals in descending order per the impact factors

Discussion

The AOFAS and VAS scores were the most frequently applied assessments to rate the outcome in hallux valgus surgery for the past eleven years. This result was independent of the studies’ level of evidence, the type of surgery performed, or the impact factor of the journal in which the study was published.

The current study is part of the revision process of the German Hallux Valgus guideline. The authors’ intention was to define a standard set of PROMs to evaluate the patient-rated outcome of hallux valgus surgery. This set of scores should be valid and allow a comparison to current literature. Our approach was to re-analyze the studies identified in a previously published living systematic review which included prospective studies, published after 2012, comparing either two surgical procedures or one surgical procedure for different stages of hallux valgus deformity. Consequently, the analyzed data set could have a selection bias. Still, only higher quality studies were included and one could assume, that these studies spend the most time on properly designing the methodology used. Although eight different scores were used, the by far most frequently assessed ones were the AOFAS and VAS. The VAS was scored on a ten-item Likert scale in all cases.

Throughout foot and ankle literature, the AOFAS Clinical Rating Systems [53] are the most commonly used outcome score. This stayed true for the current analysis on studies on hallux valgus surgery. Due to the fact that this study is a component of the revision of the German Hallux Valgus guidelines, one limitation of our study is that only studies published after 2012 were included. The large number of studies included, however, enables for further interpretation of the state of the art of PROMs used in hallux valgus surgery outcome studies.

Although the AOFAS remain the top dog, they have been criticized for several reasons. First, the AOFAS are not PROMs, as they combine a patient rated and clinician rated section. They are clinician-based outcome measures, which evaluate patients’ pain, function, and alignment based upon clinicians’ observations. They therefore do not eliminate a possible observer bias [54]. Subsequent studies demonstrated their limitations and the American Orthopaedic Foot and Ankle Society does not endorse the scales due to insufficient reliability and validity [55]. Guyton et al. [56] conducted a Monte Carlo computer modelling technique to assess limitations of the AOFAS scoring system. They simulated for each item the responses of different, idealized patient populations. The two major points of concern per the reliability of the AOFAS score were: the scoring items are used as absolute descriptors (e.g. “no limitation” or “no pain”) and are therefore susceptible for an interpretation bias by both, patients and clinicians; the limited number of response intervals leads to a pronounced floor- and/or ceiling effect [56, 57]. Furthermore, the AOFAS overemphasis the symptoms pain, equaling a maximum of 40 points, resulting in inferior outcome measures concerning other symptoms like stiffness or deformity [58]. Finally, the MCID is less for older patients compared to younger patients, and those patients with middle-range disability generally have less MCID values compared to those with minimal or severe disability [59]. Use of the AOFAS Clinical Rating Systems as the sole instrument is therefore discouraged [55].

Due to these limitations, the foot and ankle community, should strive to establish a new standard to assess patient rated outcomes, not only in hallux valgus surgery. There are more than 89 assessment tools available which measure overall foot and ankle function, overall health, or are designed for specific diagnoses and procedures.

General quality of life outcomes scores or pain scores are not enough to evaluate a hallux valgus population. Other measures have been developed and tested for a wide variety of pathologies in foot and ankle surgery (FFI, FAAM, AAOS, FHSQ and others). Furthermore, a disease-specific outcome measure is necessary to assess outcomes. MOXFQ, SEFAS and FAOS have been evaluated for hallux valgus surgery.

The FAOS [60] consisting of five subscales, with 42 items was derived from the Knee injury and Osteoarthritis Outcome score (KOOS) [61]. It was validated on patients with general foot and ankle disorders first, then on patients with hallux valgus deformity [54]. It showed acceptable validity, reliability, responsiveness, and comparability to the SF-36 in four out of five subscales [54]. The sports and recreation subscale showed little responsiveness to hallux valgus surgery and ceiling effects were present for the activities of the daily living and sports scale. The symptoms subscale showed a low correlation to the SF-36 due to the foot-specific items assessed in the FAOS [54].

The SEFAS consisting of 12 items, with 3 subscales was developed for assessing ankle replacement surgery but has been tested on a hallux valgus population with good psychometric properties [62]. It presented good validity, reliability, and responsiveness with a lack of MCID data.

The MOXFQ, consisting of 16 items with 3 domains, has been validated for foot and ankle disorders in general and specifically for a hallux valgus population. It has been extensively tested and was more sensitive than general health measures for quantifying hallux valgus surgery [63, 64].It has been compared to other outcome measures with good results. Comparison of the MOXFQ and the SEFAS demonstrated good psychometric properties with excellent test–retest reliability and internal consistency for both scores with superior responsiveness for the MOXFQ [65]. The MOXFQ showed higher responsiveness to detect changes over time or after surgery and has been translated and evaluated in more languages than SEFAS [65].

Recently the EFAS score has been validated in a population of hallux valgus patients with a short follow-up time of 6 months [66]. It has been tested with fair construct validity and reliability. However, responsiveness has not been evaluated at all. Further validation and comparative studies are necessary to rate the EFAS score in comparison to the above-mentioned PROMs.

Based on these considerations, it is even more surprising, that the vast majority of authors still relies on the AOFAS as their primary outcome score. Only three studies assessed the MOXFQ and no studies the SEFAS or EFAS score. The expert panel revising the German guidelines for hallux valgus surgery therefore recommends the use of the MOXFQ as the primary outcome score, due to its higher responsiveness and availability in more languages. There is a strong recommendation to also assess the VAS (10-item Likert scale). The use of the EFAS will be reconsidered in the next guideline revision. The AOFAS should only be assessed as a secondary outcome parameter to allow a higher comparability between the different studies.

Conclusion

Based on a systematic literature review, the AOFAS and VAS are the most frequently used outcome tools in studies assessing the outcome following hallux valgus surgery. Based on the literature available, the MOXFQ is a more valid alternative.