Introduction

Recently, therapeutic options have been selected quite often on the basis of evidence-based medicine (EBM). Thus, we are beginning to appreciate the importance of a standard rating system to evaluate such evidence. Such a rating system demands reliability in rating as well as appropriate coverage of the diseases concerned and methods for their therapy. In this context, in orthopedic surgery, several standard rating systems have undergone a number of examinations for reliability.18 Unfortunately, however, in the field of foot and ankle joints the validity and reliability of the Japanese Orthopaedic Association (JOA) scale have not been verified.9,10 Moreover, although the American Orthopaedic Foot and Ankle Society (AOFAS) clinical rating system11 could now be called a global standard, it has not been verified as to its validity and reliability.

The JOA attempted to provide an internationally accepted standard rating system that incorporated not only objective evaluation by orthopedists but also sub- jective evaluation by patients. The JOA thus delegated tasks to each member association to adjust and modify standard rating systems and verify their validity and reliability. In responding to this request, the Japanese Society of Surgery of the Foot (JSSF) organized the Committee on Rating Standards for Foot Disease in June 2000. After many discussions they created the JSSF standard rating system composed of five new scales, four of which were set up for four respective sites by modifying the AOFAS clinical rating systems11; the remaining scale was for the rheumatoid arthritis (RA) foot and ankle joint by modifying the conventional JOA scale9,10 (part I of this study, which appears in this issue). Moreover, each scale included an explanation as well as rating scores for each item so the individual items to be evaluated could be understood (part I of this study). Our current four site-specific scales are a completely novel and original Japanese version and are far from a duplicate of the AOFAS clinical rating system, as we modified the expressions and content to suit Japanese people. We also added interpretation criteria for each item and rating criteria, such as a pain scale, which were lacking in the AOFAS scale. This is why the Committee on Rating Standards for Foot Disease of the JSSF grouped together the five scales, comprised of four site-specific scales and the RA foot and ankle scale and termed it the JSSF standard rating system. From the year 2001 on, actual patients were evaluated to collect data employing the JSSF standard rating system in multiple institutes.

In part II (described herein) we report the results of studies performed on a multiple-institution scale on the validity and inter- and intraclinician reliability of the evaluation items with regard to the JSSF standard rating system composed of these five scales as well as the conventional JOA scale.

Materials and methods

Selection of clinicians as evaluators

The subjects were orthopedists at nine institutions to which the authors belonged. Because it was thought that clinical experience would influence the reliability of the evaluation, the clinicians were selected according to the following three levels of experience: (1) much experience (specialist with at least 2 years’ experience in foot surgery); (2) moderate experience (generalist with approximately 6–7 years’ experience in an orthopedics department); and (3) little experience [recently (within 1–2 years) graduated resident from a medical university). In most cases two orthopedists representing each level of experience were selected from each institute.

Selection of patients as evaluators

Patients with diseases of the foot and ankle who met the following criteria were included: (1) symptomatically stable for at least 1 month prior to the study; (2) symptomatically stable for at least for 1 month after the first evaluation; (3) consented to participate in the study; and (4) had no underlying diseases or complications that might interfere with the results of the evaluation.

Study design

A clinician from the same institution independently evaluated all the patients selected from that institution (first evaluation). Attempts were made to conduct the evaluation within 1 day, but when it was not possible it was extended into the second day. No other evaluating clinicians were present during this first evaluation. The evaluating clinician explained to the patients that simple answers to the questions were expected. When possible, the same evaluating clinician performed both the first and second evaluations. The second evaluation was conducted within 1–4 weeks of the first evaluation. As for the first examination, the second was conducted on the same day if possible. The results were recorded immediately after the evaluation, and subsequent corrections were prohibited. The results of the first evaluation were concealed at the time of the second evaluation. Patients were evaluated according to the order of the items on the instrument being evaluated. The evaluation of the items in both the JSSF standard rating system and the JOA scale were conducted on the same day as far as possible. The results were sent to the server at each institution using the Web system established for data collection in the present study and stored until tabulation.

Statistical methods

  1. 1.

    To determine interclinician agreement in terms of the total scores (validity), the intraclass correlation coefficient (ICC) was calculated from the evaluation data, which was collected from at least two patients who underwent the same evaluation by at least two clinicians from the same institution if all relevant data from those institutions were available (Analyzed Subject Data A). To establish the multiinstitutional overall scale for interclinician reliability, the ICC was calculated by the random effect model using data obtained for patients with diseases of the ankle-hindfoot. Sufficient data for other sites were not available from all of the institutions, but sufficient data for this site was available from five institutions.

  2. 2.

    To determine intraclinician agreement (validity), the total scores from the first and second evaluations, respectively, were determined from the distribution of differences in the data between the two evaluations for each institution that provided sufficient data (Analyzing Subject Data B). Each item was evaluated by determining Cohen’s coefficient of agreement (κ) and the rate of complete agreement (RC) between the first and second evaluations.

  3. 3.

    To determine the relation between the scores in each scale and patient satisfaction, the relation between patient satisfaction and outcome (total score) was investigated using the evaluations of only those patients who had undergone surgery (Analyzing Subject Data C). The degree of satisfaction was evaluated as “very satisfactory,” “satisfactory,” “noncomputable,” “slightly unsatisfactory,” and “very unsatisfactory.” The total score for each degree of satisfaction was 0–50, 60–69, 70–79, 80–89, and 90–100 points ranked as 0, 1, 2, 3, and 4, respectively. Spearman’s rank correlation coefficient (ρ) was then obtained.

Results

Evaluating clinicians and patients

A total of 65 clinicians evaluated the patients. The distribution of clinicians according to experience level was 21.5% specialists, 30.8% generalists, and 47.7% residents. There were 610 patients, representing 313 diseases of the ankle-hindfoot, 47 diseases of the midfoot, 153 diseases of the hallux, 50 diseases of the lesser toe, and 47 with RA. Evaluation by the JOA scale was conducted simultaneously with that by JSSF scales in 501 of the 610 patients.

Results of statistical analysis

  1. 1.

    For Data A, the number of patients and the number of evaluating clinicians varied among the institutions. With the lower limit of the 95% confidence interval (CI) of the ICC calculated as an indication of interclinician agreement being 0.41, a value of >0.41 was observed for the ankle-hindfoot and hallux by the JSSF scales and for the ankle-hindfoot, midfoot, and lesser toe by the JOA scale (P < 0.05; ICC > 0.4 in testing) (Table 1). As for patients with diseases of the ankle-hindfoot, the overall ICC calculated from the data for the five institutions was 0.93 for the JSSF scale compared with 0.91 for the JOA scale.

    Table 1 Intraclass correlation coefficient at each institution
  2. 2.

    For Data B, the percentages of values for each site evaluated by the JSSF scales relative to that evaluated by the JOA scale were as follows: 83 to 83 for the ankle-hindfoot, 10 to 4 for the midfoot, 45 to 56 for the hallux, 6 to 4 for the lesser toe, and 21 to 21 for RA.

    1. a.

      Distribution of differences in total scores.

      1. 1)

        Regardless of the experience level, the difference in total scores between the first and second evaluation was within the range of ±1 in 43.4% and 42.3% of the data evaluated by the JSSF and JOA scales, respectively, for the ankle-hindfoot, indicating almost no difference between the two. These frequencies were higher than those for other sites, and the difference was within ±5 in approximately 70% of data evaluated by the two scales for the ankle-hindfoot. The difference was within a range of ±1 in 31.1% and 37.5% of the data evaluated by the JSSF scales and the JOA scale, respectively, for the hallux. The corresponding frequencies in RA patients were 19.5% and 19.0% of data evaluated by the JSSF and JOA scales, respectively; differences within the range of ±5 were observed in approximately 60% of the data evaluated by the two scales. It was difficult to evaluate the midfoot and lesser toe because of the small number of patients with diseases at these sites (Table 2).

        Table 2 Distribution of difference in data between first and second evaluations (regardless of experience level)
      2. 2)

        The influence of experience level was observed when the difference in the total scores between the first and second evaluations was within a range of ±1; a tendency toward the presence of influence of the experience level was observed in data evaluated by the JSSF scale for the ankle-hindfoot and in data evaluated by both scales for the hallux and RA. When the difference was within the range of ±5, however, there was almost no difference in the results depending on the experience level. It was difficult to evaluate the midfoot and lesser toe because of the small number of patients with diseases at these sites (Table 3).

        Table 3 Distribution of difference in data between first and second evaluations (with regard to experience level)
    2. b.

      Evaluation of each item.

      1. 1)

        For the first and second evaluations, Cohen’s coefficient of agreement (κ) was high for all items for the ankle-hindfoot evaluated by the JSSF scale and low for sagittal motion, muscle strength, and sensory disturbance (paresthesia) of the hindfoot evaluated by the JOA scale (Table 4). The coefficient (κ) was low for all items other than sagittal motion of the metatarsophalangeal (MTP) joint of the hallux evaluated by the JSSF scale, and high for most of the items evaluated by the JOA scale. It was difficult to evaluate data for the midfoot, lesser toe, and RA because of the small number of patients in the respective categories.

        Table 4 Rate of complete agreement and Cohen’s coefficient of agreement
      2. 2)

        The mean RCs for each item evaluated by the JSSF and JOA scales were 81.2% and 84.3%, respectively, for the ankle-hindfoot; 70% and 57.1%, respectively, for the midfoot; 75.6% and 78.5%, respectively, for the hallux; 83.3% and 82.1%, respectively, for the lesser toe; and 76.2% and 77.5%, respectively, for RA. Accordforg to the items, the intraclinician RC was high for all items of the ankle-hindfoot by the JSSF scale, whereas the rate was low for instability of the ankle-hindfoot by the JOA scale. The rate was low for alignment of the hallux by the JSSF scale and for pain, deformed forefoot, hindfoot sagittal motion, and walking on tiptoe by the JOA scale. The rate was low for a deformed lesser toe of the forefoot, deformed hindfoot, and ability to walk when evaluated by the JSSF scale in RA patients and for pain, deformed forefoot, hindfoot sagittal motion, and ability to walk when evaluated by the JOA scale.

    3. 3.

      For Data C, the ratios of the total score for each site as evaluated by the JSSF scales to those as evaluated by the JOA scale were as follows: 169 : 161 for the ankle-hindfoot, 14 : 14 for the midfoot, 99 : 105 for the hallux, 34 : 33 for the lesser toe, and 24 : 24 for RA.

      1. a.

        There was a significant correlation between patient satisfaction and the total score (outcome) for the hindfoot and hallux by the JSSF standard rating system and for the ankle-hindfoot, hallux, and lesser toe by the JOA scale (Table 5).

        Table 5 Relation between patient satisfaction and total score (outcome)

Discussion

With the practice of EBM gaining ground worldwide, many epidemiological surveys and clinical studies are being performed for the purpose of obtaining evidence. An assessment of the results is essential for surveys and studies, and the relative superiority of the efficacy of one treatment or therapeutic effect over another should be evaluated based on the results of such determinations. For objective assessment of the results, a standard rating scale for evaluation should therefore be established. Important requirements for a rating scale are a high degree of validity and reliability. To our knowledge, the intraclinician and interclinician validity and reliability of standard rating systems for evaluating diseases of the foot and ankle, including the AOFAS clinical rating systems, have never been examined by multiinstitutional studies.

As for the interclinician agreement in terms of the total scores, the ICC was calculated from data obtained from evaluation of at least two of the same patients by multiple clinicians at the same institution. Only institutions from which there were sufficient data for analysis were included. At each institution, the ICC was high for the ankle-hindfoot and hallux by the JSSF scales and high for the ankle-hindfoot, midfoot, and lesser toe by the JOA scale. These results indicate that reliability was high at each institution, although overall multiinstitutional interclinician reliability could not be evaluated. When following the method employed in the report that evaluated reliability over all participating institutions using the ICC by the random effect model7 it is possible that one cannot obtain a correct evaluation in such cases where the experience or knowledge of the examiners or the severity of the disease in patients differs among institutions or where the amount of data is small. Therefore, in principle we calculated each ICC for each institution. To verify our findings, we calculated the ICC from data for the ankle-hindfoot for all five institutions following a similar random effect model7 and found that the ICC was 0.9 or higher by both the JSSF scale and the JOA scale. Even when the same patient was examined at many institutions, the reliability of the standard rating scale for evaluation of diseases of the ankle-hindfoot was estimated to be high.

When interclinician and intraclinician reliability of the JSSF standard rating system and the JOA scale were investigated merely from the viewpoint of differences in the total scores between the first and second evaluations, the range of validity tended to increase for the hallux and RA compared to that for the ankle-hindfoot, for which the validity was already found to be relatively high. The RC, which was reflected by Cohen’s coefficient of agreement for each item, also showed high validity on the JSSF and JOA scales for evaluation of the ankle-hindfoot, with almost no difference observed between the two scales, whereas the validity of the JOA scale for the hallux was higher than that of the JSSF scale. Thus, there was a difference in validity between the two scales for some sites of the foot and ankle. There were also some items for which statistical analysis could not be conducted because of the small number of patients; but the validity of the JSSF standard rating system was evaluated as being high by the assessment of intraclinician agreement because the concept of each scale of the JSSF standard rating system is almost the same.

As for intraclinician agreement assessed according to the level of clinical experience, it is assumed that proficiency in evaluation is necessary to obtain high validity of the evaluation when investigated only from the distribution of differences in the total scores.

“The degree of satisfaction” in the evaluation of treatment is related to psychological aspects on the part of patients and differs from the functional aspects evaluated by clinicians. Therefore, the correlation between the degree of satisfaction on the part of patients and functional assessment by clinicians is not necessarily high, but there was a tendency for the outcome to be correlated with patient satisfaction. Each item in the standard rating system was considered to be a reflection of a subjective evaluation on the part of the patients. Recently, results of findings by instruments on the severity of pain by visual analogue scales (VAS) and questionnaires about the quality of life (QOL) by SF-36 and others, in which QOL is evaluated based on scales that take into account the viewpoint of patients, have been shown to be as reproducible as results based on data from pathophysiologic evaluations by clinicians. In other words, therapeutic results are increasingly determined directly according to the patient’s own evaluation from the viewpoint of EBM because there is much room for bias in evaluations by clinicians; thus, instruments such as the VAS and SF-36 produce highly accurate information.1218 Therefore, each standard rating scale for evaluation that was inspected in this study is assumed to be a reflection to some extent of the subjective evaluation on the part of patients, but a standard rating system that would allow evaluation of the symptomatic improvement and QOL of patients from different viewpoints needs to be established in the future.

The present study was conducted with the aim of evaluating the validity and reliability of the JSSF standard rating system and the JOA scale according to the site of involvement in the foot and ankle. Diagnostic workups of the same patients at multiple institutions are difficult. Therefore, we were obliged to limit our analysis of interclinician reliability to that from data compiled at individual institutions. To analyze interclinician reliability more precisely, a different study design from that employed in the present study may be required.

Based on intraclinician reliability and the results of analysis of the relation between patient satisfaction and outcome, however, the validity of the JSSF standard rating system and the JOA scale was high for the items evaluated. It can be considered that clinical evaluation of therapeutic results using these scales would be highly reliable.