Introduction

Patient-based outcome instruments, which are used to measure changes in health status over time, have become increasingly popular. The four basic types of patient-based outcome instruments are generic, disease-specific, region-specific, and patient-specific. A region-specific instrument contains items specific to only one body part and can be used with several different disease states affecting a specific region. The Japanese Society for Surgery of the Foot (JSSF) is developing a QOL questionnaire for use in individuals with pathological conditions related to the foot and ankle as a region-specific outcome instrument.

The questionnaire, named the Self-Administered Foot Evaluation Questionnaire (hereafter referred to as the “SAFE-Q”) version 1, was subjected to through an initial field test [1], after which it was revised to a second version [2]. The main body of the SAFE-Q version 2 consists of 34 questionnaire items, providing five subscale scores (1: Pain and Pain-Related; 2: Physical Functioning and Daily Living; 3: Social Functioning; 4: Shoe-Related; and 5: General Health and Well-Being). In addition, the instrument has nine optional questionnaire items that provide a Sports Activity score.

The SAFE-Q version 2 was subjected to a limited field test. Tentative scores for the five subscales were compared to their corresponding scales in the Short Form 36 Health Survey, version 2.0 (SF-36) [3] and the JSSF Scale score [4, 5], and the results obtained were reasonable [1]. Therefore, based upon its favorable performance in the previous field test [2], the JSSF decided to evaluate the second version of the SAFE-Q further by applying it to a larger sample of patients with foot and ankle disorders as well as a control sample of healthy teenagers and adults.

Because the factor structure of the responses to the instrument was valid in the former study, the primary aim of the present field survey was to evaluate the test-retest reliability. A secondary aim was to test the influence of background factors such as region-specific classification, age group, and gender on the subscale scores. This report provides an analysis of the data gathered in this second field test of the second version of the SAFE-Q.

Patients and methods

Study group

In the present field survey, the SAFE-Q version 2 was administered to 876 patients with pathological conditions related to the foot and ankle. A total of 491 non-patients consisting of healthy teenagers and adult volunteers were also analyzed. Both patients and non-patients had been registered in a total of 99 institutions in Japan.

Although the SAFE-Q version 1 has already been presented in a previous article [1], we have provided the SAFE-Q version 2 in “Appendix 1” for the sake of reader convenience. In addition, the manual for the SAFE-Q is shown in “Appendix 2.”

Among the 876 patients, 131 of them with stable pathological symptoms attended the test-retest reliability evaluation. The same questionnaire form was answered by these patients twice in succession. The interval between the first and second tests was a minimum of eight weeks. When the test was first administered, an SF-36 questionnaire form was also answered by the subjects, and the JSSF Scale scoring form was recorded by a physician.

Ethical issue

This study was approved by the Life Ethics Committee of St. Marianna University School of Medicine in 2007 (no. 1192). The elongation of the research period until 2014 was approved in 2012.

Statistical analysis

EFA and CFA

An exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) were performed. These were done to determine whether the factor structure was stable, given that the patient population in this field test comprised a wide variety of pathologies. Response data from the patients during the first administration (but not the retest) of this field test were subjected to the same EFA and CFA as used in the first field test of the second version.

Computation of subscale scores

Subscale scores were computed for each of the five subscales. To compute the scores, for each subscale, the average non-missing values of items contributing to the subscale were computed for each respondent. Prior to averaging, VAS items were rescaled to conform to the ranges of the categorical items. Averages were then rescaled so that the final sum of scores ranged between zero (least healthy) and 100 (healthiest), inclusive.

Test-retest reliability

Each subscale’s scores were subjected to a random-effects linear regression with test-retest as a categorical predictor. The intraclass correlation coefficient (ICC) was computed as the index of reliability for each scale. Ninety-five percent confidence intervals (95 % CIs) for ICCs were computed by parametric bootstrapping [6] using 100 bootstrap samples of patients with scores for the scale for both test and retest administrations of the questionnaire.

Comparison with JSSF Scale scores

Spearman’s rank correlation coefficients were computed between the scores for each of the five SAFE-Q subscales and the JSSF Scale scores (which were only taken from patient responders during the first administration of the questionnaire).

Comparison with SF-36 scores

Spearman’s rank correlation coefficients were computed between the scores for each of the five SAFE-Q subscales and those for each of the eight SF-36 subscales. Scores for each of the eight SF-36 subscales were computed using the Japanese norm-based scoring method as prescribed in the commercial instrument’s documentation [3]. Again, QOL scores were only taken from patient responders during the first administration of the questionnaire.

Comparison of scores for the Pain and Pain-Related subscale and the SF-36 Bodily Pain subscale

We compared the patients’ scores for the Pain and Pain-Related subscale with the scores for the SF-36 Bodily Pain subscale. For this purpose, we extracted the values for the Pain subscale scores from the JSSF Scale scores. On the JSSF Pain subscale, 0, 20, 30, and 40 points are assigned to patients with diseases of the ankle and hindfoot, midfoot, hallux, and lesser toe, respectively; and 0, 10, 20, or 30 points are assigned to patients with rheumatoid arthritis. Thus, we computed the Spearman’s rank correlation coefficient between the JSSF Pain score and the Pain and Pain-Related score or SF-36 Bodily Pain score for each of the patient groups.

Background factors

The following patient characteristics were assessed using scores from the first administration of the questionnaire: patient group in the JSSF Scale classification, age group, and gender. Patient groups in the JSSF Scale classification were as follows: ankle and hindfoot, midfoot, hallux, lesser toe, and rheumatoid arthritis. Respondents were grouped by age as follows: 16–39, 40–64, and 65 and older, inclusive. For the patient groups and patient-age groups, each of the five subscales was assessed by means of one-way analysis of variance (ANOVA). Gender comparisons were made by means of Student’s t test in each subgroup of patients classified by patient group and age group. Dunnett’s multiple comparisons test was performed afterward to compare patient groups. In order to stabilize the variances in the presence of floor and ceiling effects, the data were arcsine square-root transformed prior to performing ANOVA or other tests.

Patient versus non-patient comparison

Scores for each of the five subscales were compared between patients (first administration of the questionnaire) and non-patients by means of the Mann–Whitney test. This nonparametric test was used for this comparison due to concern over the large proportion of ceiling responses in the healthy group.

Sports items

Sports-related questionnaire items were scored as above, taking into account the reversal of sense of the VAS item among them. EFA was applied to the responses of patients during the first administration of the questionnaire in order to confirm the unidimensionality of the scale. The test-retest reliability of the sum of these items’ scores was assessed as above.

Statistical probability

In the statistical comparisons, a p value of less than 0.05 was considered statistically significant. Below, for all p values less than 0.001, we simply state p < 0.001, even when the exact value was obtained from the computation.

Results

Patient and non-patient classification and age

The classification of the subjects enrolled in the present field study is summarized in Table 1. A total of 876 patients and 491 non-patients were registered. The majority of the patients had diseases of the ankle and hindfoot (469). Numbers of patients in the lesser toe (45) and midfoot (68) groups were less than 100. The JSSF region-specific classification was not reported for eight patients. The mean age of the patients in each group and that of the non-patients are also indicated in Table 1. As a whole, the mean ages of the patients and non-patients were 52.6 ± 18.0 (mean ± SD; n = 876) and 44.6 ± 16.6 (n = 491), respectively.

Table 1 Numbers of patients and non-patients

Factor analysis

The factor structure was remarkably stable, in that factor loadings and residual variances were essentially the same as those obtained in the previous field test of the SAFE-Q version 2 (data not shown). The factor correlation coefficients resulting from the CFA are summarized in Table 2. All of the correlations between different subscale factors were less than 0.9. The maximum coefficient was 0.841, for the correlation between the Physical Functioning and Daily Living subscale and the Social Functioning subscale.

Table 2 Factor correlation coefficients among five subscales resulting from confirmatory factor analysis

Test-retest reliability

The value of the ICC for each of the five subscales is listed in Table 3. The ICC was always larger than 0.7; even the minimum 95 % CI lower limit for the Social Functioning subscale was larger than 0.6. The ICC for the sum of the subscale scores was 0.85 (with a 95 % CI of 0.81 to 0.89), which is, as expected, higher than any of the individual components.

Table 3 Values of ICC observed for the five subscales

Distribution of subscale scores

The distributions of the subscale scores are illustrated in Fig. 1. The mean ± SD and median for the five subscales were as follows: Pain and Pain-Related: 66.0 ± 23.8, 70.1; Physical Functioning and Daily Living: 69.2 ± 26.2, 75.0; Social Functioning: 66.3 ± 32.4, 75.0; Shoe-Related: 62.7 ± 30.4, 66.7; General Health and Well-Being: 66.8 ± 29.7, 75.0. The width between the 25th percentile and the 75th percentile was broad in the Social Functioning, Shoe-Related, and General Health and Well-Being subscales, while smaller widths were observed in the Pain and Pain-Related and Physical Functioning and Daily Living subscales. The values of the means were very similar for the five subscales, ranging from 60 to 70.

Fig. 1
figure 1

Subscale score distributions. Left and right rectangle edges indicate the 25th and 75th percentiles. Vertical lines within the rectangles show the medians. Bullet marks indicate the means. Left and right ends of the horizontal lines passing through the rectangles represent the 5th and 95th percentiles

Comparison with the JSSF Scale score

The distribution of the JSSF Scale score is illustrated in Fig. 2. The mean ± SD and median were 69.4 ± 20.9 and 72 (n = 864), respectively. The JSSF score was correlated with each of the present subscale scores. The Spearman’s rank correlation coefficients are summarized in Table 4, where the patients are classified into JSSF patient groups. The scores for the five subscales display statistically significant correlations (p < 0.001) with the JSSF Scale score, with rank correlation coefficients ranging from 0.51 to 0.61 (Table 4). This tendency was the same in each group of patents. However, slightly smaller correlation coefficients were observed in the lesser toe group containing 45 patients.

Fig. 2
figure 2

Distribution of the JSSF Scale scores for the present patients

Table 4 Correlations with the JSSF score for the five patient groups

SF-36

The Spearman rank correlation coefficients between each of the five subscales of the SAFE-Q and each of the eight SF-36 subscales were all statistically significantly different from zero (p < 0.001), as summarized in Table 5. The correlation coefficient for the Pain and Pain-Related subscale was highest with Bodily Pain; the correlation coefficient for the Shoe-Related subscale was highest with Bodily Pain and Physical Functioning; the correlation coefficient for the Physical Functioning and Daily Living subscale was highest with Physical Functioning; the correlation coefficient for the Social Functioning subscale was highest with Role Physical and Bodily Pain (but nearly as high with Social Functioning and Physical Functioning); the correlation coefficient for the General Health and Well-Being subscale was highest with Bodily Pain. In these particular patients, the scores obtained with these two instruments were largely driven by pain and difficulty with mobility. The mean ± SD of each norm-based [3] SF-36 subscale score are also shown in Table 5. The mean of the norm-based SF-36 score ranged from 36 to 47 for these patients, indicating that the patients were somewhat below average in their health status.

Table 5 Comparison of scores for subscales of the SAFE-Q version 2 with SF-36 subscale scores

Comparison of scores from the SAFE-Q Pain and Pain-Related subscale and SF-36 Bodily Pain subscale scores

Results of comparisons of the Spearman rank correlation coefficients are summarized in Table 6. The Spearman’s rank correlation coefficients from the Pain and Pain-Related subscale were larger than those from the SF-36 Bodily Pain subscale in all groups of patients. Statistical significance (p < 0.05) was found in the ankle and hindfoot and the hallux groups.

Table 6 Comparisons of Spearman’s rank correlation coefficients between the present Pain and Pain-Related subscale score and the SF-36 Bodily Pain subscale score

Patient characteristics

Comparison among patient groups

A comparison of the mean subscale scores and SDs of the different JSSF patient groups is provided in Fig. 3. The scores for the five patient groups were statistically significantly different according to one-way ANOVA, for all subscales. The p values from ANOVA were smaller than 0.001 for the Physical Functioning and Daily Living and the Shoe-Related subscales, and were between 0.002 and 0.02 for the other subscales. Patients with rheumatoid arthritis showed the lowest mean values for the five subscales, and the differences between these mean values and those of other patient groups were sometimes found to be statistically significant upon performing Dunnett-type comparison tests, as shown in Fig. 3.

Fig. 3
figure 3

Comparison of the means and SDs of the five subscale scores among the five JSSF patient groups: 1 ankle and hindfoot; 2 midfoot; 3 hallux; 4 lesser toe; and 5 rheumatoid arthritis. Asterisks (*p < 0.05; **p < 0.01) indicate p values from Dunnett-type comparisons with the rheumatoid arthritis group

Age and gender

The subscale scores for male and female patients were compared for three age groups (ages 16–39, ages 40–64, ages 65 and older, inclusive) in Fig. 4. The size of the sample analyzed in this work is large enough to allow subscale-specific comparisons of scores among age groups and genders. One-way ANOVA revealed that there were statistically significant differences (p < 0.001) among the age groups in all five subscales when only female patients were considered. When only male patients were considered, there were no statistical significant differences among the age groups for any of the subscales aside from the Physical Functioning and Daily Life subscale (p < 0.001). The scores of male and female patients are also compared in Fig. 4. In all subscales, the male scores were always higher than the female scores, whichever age group was considered; the differences between the male and female scores were sometimes significant, as shown in Fig. 4.

Fig. 4
figure 4

Mean subscale scores (and their SDs) for each age group (1 ages 16–39, 2 ages 40–64, 3 ages 65 and older, inclusive) and gender (open column male; closed column female). **p < 0.01 for comparisons between genders. Using ANOVA, the female-only scores were found to be significantly different among all age groups and subscales (p < 0.001), but when only males were considered, only the Physical Functioning and Daily Living subscale scores were only significantly different among the age groups

Comparison of the scores of patients and non-patients

As expected, patients scored lower (less healthy) on average than non-patients on each of the five subscales (Table 7). The p value from the Mann–Whitney test comparing patients and non-patients was less than 0.001 for all five subscales. The means and SDs of the five subscale scores for non-patients are summarized in Table 8. Older non-patients tended to present lower mean values than younger non-patients, and female non-patients tended to present lower mean values than male non-patients.

Table 7 Comparison of the subscale scores of patients and non-patients
Table 8 Means and SDs of the five subscale scores for non-patients, classified by age and gender

Sports items

Optional sports items were responded to by 275 patients and 197 non-patients. EFA of the resulting patient data showed that these items contributed to a single major factor, as seen before (data not shown). The test-retest reliability for sports items was similar to that observed for the other sets of items: ICC = 0.76, with a 95 % CI of 0.64–0.87. The mean ± SD of the Sports Activity score was 45.3 ± 34.2 in patients, and it was 95.7 ± 10.9 in non-patients. The difference in the mean scores of patients and non-patients was statistically significant (p < 0.001).

Discussion

Several patient-based and region-specific outcome instruments for patients with diseases or injuries of the foot and ankle region, such as the American Academy of Orthopaedic Surgeons lower limb outcomes assessment instruments (including the Foot and Ankle Module (AAOS-FA) [7], Foot and Ankle Ability Measure (FAAM) [8], Foot Health Status Questionnaire (FHSQ) [9], and Foot Function Index [10]), have been developed. Recently, a comparison of the responsiveness of the Manchester–Oxford foot questionnaire (MOXFQ) with those of the American Orthopaedic Foot Ankle Society [AOFAS] [11], SF-36 [12], and EuroQol (EQ-5D) [13] assessments following foot or ankle surgery was published [14]. Although the MOXFQ is a patient-based outcome measure, it was originally developed based on interviews with patients who had foot surgery. In the interviews, however, the Manchester Foot Pain and Disability Questionnaire (MFPDQ) [15] had been utilized as a template. In addition, the measurement properties of the MOXFQ were initially assessed in a specific group of patients undergoing surgery for hallux valgus [16, 17]. In this context, there is no new and original patient-based outcome instrument focusing only on the foot and ankle that is similar to the various instruments that have already been verified to be valid, repeatable, and reliable.

There are potential advantages and disadvantages associated with each of these instruments [18], and there is an ongoing process whereby evidence is collected to support their use under various conditions. The usefulness of an outcome instrument is never completely established. There is currently an urgent need for scientific evaluation of foot and ankle surgery, which in turn requires the use of appropriate (patient-based) standard methods of outcome assessment. In this context, the Japanese Society for Surgery of the Foot (JSSF) is developing a QOL questionnaire for use in individuals with pathological conditions related to the foot and ankle as a region-specific outcome instrument.

The present field test of the second version of the SAFE-Q replicated the factor structure of the same version of the SAFE-Q in its first field test (which had a smaller patient sample). The test-retest reliability was high for each of the subscales and for the average of all subscales. Gender-related differences, observed in particular for the Shoe-Related subscale and Physical Functioning and Daily Living subscale, might reflect the well-known foot-health consequences of women wearing high-heeled footwear and women’s more fashion-oriented attitude towards shoes. It is believed by many surgeons that age-related differences reflect a general decline in overall health and physical vigor, as well as a general reduced ability to recover quickly from health-related problems.

The differences between patient groups were also statistically significant according to ANOVA. In particular, patients with rheumatoid arthritis appeared to fare more poorly than patients in other region-specific categories (Fig. 3). Nevertheless, the averages for the patient groups fell in a relatively narrow range, indicating that the SAFE-Q labels are sufficiently similar to allow their use in all patient groups.

As expected, the SAFE-Q readily distinguished patients with foot and ankle disorders from non-patients. The mean scores on the subscales range between 60 and 75, which may lead to concern over the sensitivity or dynamic range of the QOL instrument in these patients. In contrast to this, the distribution of JSSF scores observed in the patients implies that most of the patients did not have severe symptoms (Fig. 2). This is a plausible reason for the scattered range of mean values observed in the present field test.

Given the large sample size, the coefficients for the correlation of the SAFE-Q subscale scores with the JSSF Scale score were all highly statistically significantly greater than zero. Likewise, the coefficients of the correlation of the SAFE-Q subscale scores and SF-36 subscale scores were statistically significantly greater than zero for the same reason. Nevertheless, there was a qualitative alignment of the two QOL scales when the correlation coefficient values were examined. The lack of perfect alignment indicates that the SAFE-Q constructs measured in these patients are superior to those measured by the corresponding subscales in the more general SF-36 instrument. It does appear, however, that the scores obtained using both instruments are largely driven by pain and difficulty with mobility in these patients.

The nine items of the Sports Activity subscale of the SAFE-Q consist of questions about very basic performance of sports activities [8, 19, 20]. Regarding the Sports Activity subscale, the unidimensionality of the items remained stable and the difference between patients and non-patients was apparent. In addition, the test-retest reliability was adequate. Therefore, we will add these nine items to the responsiveness analysis without changing them.

As reviewed by Martin and Irrgang [18], validity testing of QOL outcome instruments should include assessments of content validity, construct validity, test-retest reliability, and responsiveness. In our process, content validity was confirmed for the SAFE-Q version 1 [1] and version 2 [2] through the various Cronbach α metrics. Regarding construct validity, we ascertained convergence by comparing the SAFE-Q subscales with the JSSF scales and SF-36 subscales. We also studied the convergence and divergence [21] by evaluating the results from CFA. That is, we observed that the factor loading of each questionnaire item was large for the intended subscale and small for the other subscales in the previous field study, and similar results were seen in the present study.

As described above, we were able to verify that the test-retest reliability was high for each subscale. The comparison of Spearman rank correlation coefficients shown in Table 6 suggests that the Pain and Pain-Related subscale is more responsive than the SF-36 Bodily Pain subscale. However, there is no other clear standard that could be used to gauge the responsiveness of the other subscales. Additionally, the responsiveness should be evaluated by performing a longitudinal study. In the future, it will be beneficial to test the responsiveness of the present outcome instrument.