Background

It has been estimated that the prevalence of foot pain in community dwelling adults aged 65 years and over is between 20 and 42% [14] and foot pain is known to contribute to locomotor disability [19]. However, research has been hampered by the lack of an instrument with which to measure foot-related disability. The Manchester Foot Pain and Disability Index (FPDI) [10] could potentially fill this gap. The FPDI is a self-complete questionnaire consisting of 19-items, each of which has three possible response categories: "none of the time", "on some days" or "on most/every day(s)" [10]. These items were developed from interviews with people attending foot clinics for treatment who were asked open-ended questions about pain, disability, activity limitation and footwear [10]. In the development of the questionnaire, it was suggested that the two items relating to work and leisure be removed, as they might not relevant to all populations. Exploratory factor analysis then suggested that the remaining 17 items could be formed into four subscales: functional problems (10 items), two pain intensity constructs (2 items and 3 items) and personal appearance (2 items). The authors suggested that the two pain intensity subscales be combined to give 3 subscales in total (function, pain intensity, appearance) over the 17 items [10].

In the original development of the FPDI, Garrow et al [10] suggested that a simple score could be derived for each subscale. However, in their subsequent population survey, they defined disabling foot pain as present if at least one of the 17 pain intensity, function or appearance items occurred on at least "some days" in the past month [6]. Other authors have also used this approach [11, 12]. A further study by Cook et al used exploratory factor analysis to derive two subscales (foot and ankle function (9 items) and pain and appearance (7 items)) for the FPDI having deleted one item ("My feet are worse in the morning") because it did not load on to either of the factors [13]. These authors called this the Modified Manchester FPDI. However, a more recent study by Roddy et al [14] undertook confirmatory factor analysis to verify the original three subscales of Garrow et al in the 17 items (function (10 items), pain (5 items) and appearance (2 items)) [10] and demonstrated the validity and reliability of a new definition of disabling pain that required the occurrence of a problem on at least one of the ten items on the function subscale on "most/every day(s)" in the past month. In this latter study [14], the definition of disabling pain was modified, as using Garrow's definition [6], 98% of older adults with foot pain were classified as having disabling foot pain.

Each of the definitions described above produces a dichotomous evaluation of disabling foot pain, that is, disability is either present or absent. In reality, the disability caused by foot pain will be displayed along a continuum, with different people displaying differing degrees of disability. Garrow et al proposed that, using a simple scoring system, individual scores for each of the three subscales could be generated to produce an overall index of disability [10] and then, in a later study, suggested summating scores for each of the subscales expressed as a percentage ("none of the time" = 0, "on some days" = 1, "on most/every day(s)" = 2) [6]. This scoring system was used subsequently by Menz et al to produce a total FPDI score ranging from 0 to 34 in addition to subscale scores [12]. Other authors have used a different scoring system ("none of the time" = 1, "on some days" = 2, "on most/every day(s)" = 3) to produce a total score ranging from 0 to 51 and individual subscale scores [13, 15]. However, these summated totals were not suitable to correctly examine changes in score over time, or differences in scores between groups, because they were not shown to be unidimensional and were not of an interval-level, i.e. where a difference of, say, two points on the score is equivalent at all points along the continuum [16, 17].

The only way to derive interval-level scores from ordinal item responses such as those in the FPDI is through the use of the Rasch unidimensional measurement model [18, 19]. The objective of this study was to employ the Rasch model to assess the performance of the three FPDI subscales and to attempt to derive interval level subscale scores for each of the three factors of the FPDI [10, 14].

Methods

Study sample

Data for these analyses were collected in a pilot study for the North Staffordshire Osteoarthritis Project (NorStOP). The methodology for mailing Health Survey and Regional Pains Survey questionnaires in this pilot study replicated that used in the main survey, details of which have been published previously [20]. In summary, the design of the study was a two-stage cross-sectional postal survey of adults aged 50 years and over using self-complete questionnaires. A random sample of 1000 people was selected from a single general practice from the North Staffordshire General Practice Research Network. Stage 1 of the survey consisted of a Health Survey questionnaire. Responders to this questionnaire who reported foot pain in the last 12 months and gave consent to be contacted again were then sent Stage 2, a Regional Pain Survey questionnaire, which gathered more detailed information on their foot problems, including the Manchester Foot Pain and Disability Index [10].

The Rasch model

The Rasch model has been described in detail elsewhere [2123]. Briefly, a logistic function is used to relate the difficulty of an item to the ability of a person in order to obtain an interval-level score. Estimates of item difficulty and person ability are independent of each other [24], making the scale score relatively distribution-free [21]. The following sections describe characteristics explored within the Rasch model and how they are evaluated.

The model

The partial credit Rasch model [25] was used to create a separate score for each subscale of the FPDI (function, pain, appearance) using the RUMM2020 Rasch analysis package [26].

Threshold plots were inspected to ensure that response categories were ordered as would be expected (i.e. that respondents considered endorsing an item on "some days" to represent more disability than endorsing an item "none of the time", but less disability than endorsing it on "most/every day(s)").

Unidimensionality

It is essential that any scale is measuring only a single construct [27]. To ensure that the FPDI scales were unidimensional, a principal components analysis of the residuals was performed. The aim of this is to identify patterns of the residuals once the 'Rasch factor' has been extracted. This is important in order to identify any subsets of items that may be loading together, and therefore may represent a different construct. The absence of any meaningful pattern in the residuals is deemed to support the assumption of local independence of the items. In order to explore this, the two most different groups of items (i.e. those whose fit residuals load negatively and those that load positively onto the first component) were ascertained from the principal components analysis. These two sets of items produce the most different estimates of person location. Using these two sets of person locations, independent sample t-tests were conducted to assess the proportion of people in which there was a significant difference between the person locations based on the two groups of items. In order to accept that all of the items in a scale were measuring the same underlying construct, it was required that no more than 5% of these t-tests result in a p-value < 0.05 [27].

Response dependency

Response dependency occurs when the response to one item determines the response to another item [28]. For example, if a person can walk a mile, they must also be able to walk half a mile. Response dependency was assessed via the residual correlations between items, with a positive correlation noticeably higher than other correlations [29] taken to indicate dependency.

Item fit

Overall item fit was examined via the mean item fit residual. This value was expected to be approximately zero, with a standard deviation (SD) of one if the data fit the Rasch model.

The fit of individual items was examined in three different ways; the individual item fit residuals, a chi-square test and an F-test, giving three perspectives on the fit of the items [30]. The item fit residual was expected to be in the range -2.5 to +2.5 [31]. For the chi-square and F-tests, the null hypothesis was that the data were a good fit to the Rasch model. Therefore, p-values < 0.05 indicated poor fit of the item to the model. The F-test is generally more sensitive to departures from the Rasch model than the chi-square test [29]. Bonferroni adjustments [32] were made to the significance levels for the chi-square and F-tests, based on the number of items in the scale, to account for multiple testing. Therefore the critical values for each of the scales were: function 0.005, pain 0.01 and appearance 0.025.

Person fit

Overall fit of persons to the model was examined via the mean person fit residual. As with the item fit residual, if the data fit the Rasch model, the mean value was expected to be approximately zero with a standard deviation of one.

Individual person fit was assessed via the individual person fit residuals. A residual value less than -2.5 was considered indicative of a purer Guttman response pattern [33] than expected by the probabilistic Rasch model and was not regarded as problematic. A residual value greater than +2.5 was considered to be indicative of an unexpected response pattern under the Rasch model and was further investigated with a view to removing such persons from the sample [30].

Overall fit to the Rasch model

The item-trait interaction statistic is a measure of the overall fit of the data to the Rasch model. A statistically significant result on this chi-square test indicated that the hierarchical ordering of the items was not constant along the latent trait [34] and hence an interval level score has not been created.

Differential item functioning

Differential item functioning (DIF) occurs when different groups of respondents (e.g. males and females) respond differently to an individual item, despite having the same level of the underlying trait [30]. This is important because DIF can be considered a breach of unidimensionality and so items displaying substantial DIF were considered for removal from the scale [31].

In these analyses, DIF was assessed by means of a 2-way analysis of variance (ANOVA) for gender and age group (50 to 59 years, 60 to 69 years, 70 years and over) separately. A significant main effect for gender (age group) would indicate uniform DIF, i.e. males and females (different age groups) responded systematically differently to the item in question along the latent trait. A significant interaction effect between gender (age group) and the trait would indicate the presence of non-uniform DIF on this item, i.e. males and females (different age groups) responded differently to the item in question and this difference varied along the continuum of the latent trait. As for the analysis of item fit, the critical values for each of the scales were: function 0.005, pain 0.01 and appearance 0.025 after applying the Bonferroni correction [32].

Targeting of the scale

The targeting of the items and persons was assessed by comparing the mean person location to the mean item location (constrained to be zero). A negative mean person location indicates that the average item difficultly is above the average disability of the sample. A positive mean person location indicates that the average item difficulty is above the average disability of the sample. A mean person location of zero indicates that the items and the sample are perfectly targeted.

The Person Separation Index (PSI) was considered as a measure of the ability of the scale to differentiate between people. A value of 0.7 was considered suitable for group comparisons [30].

Results

Study sample

Of the 1000 Health Survey questionnaires mailed, 745 completed questionnaires were returned (adjusted response rate 77.3%). Two hundred and seventy-five respondents reported that they had experienced foot pain in the previous year. Two hundred and twenty-three of these provided consent for further contact and were mailed a Regional Pains Survey questionnaire. One hundred and ninety-seven completed questionnaires were received. The initial sample for this study then consisted of 149 people (63% female, mean (SD) age 66.1 (9.5) years) who reported foot pain on both the Health Survey and Regional Pains Survey questionnaires and had answered at least some of the FPDI items. Although a Rasch score can be estimated for those people with extreme scores (i.e. responded "none of the time " or "on most/every day(s)" to all items within a subscale), these people cannot be used in the estimation of model parameters. Hence, having removed those with extreme scores, 131 people were available for the derivation of the function subscale score, 133 for the pain subscale and 36 for the appearance subscale. This sample size for the appearance subscale was considered to be too small to allow assessment of the subscale's properties, and so further analyses of the two appearance items were not undertaken.

Fit of the data to the Rasch model

Thresholds for all items in the function and pain subscales were ordered as expected.

Unidimensionality

Independent t-tests showed the function and pain subscales of the FPDI to be unidimensional with less than five percent of people having different locations at the five percent level (function: 4.6% (95% CI 0.8%, 8.3%); pain: 0.8% (-3.0%, 4.5%)).

Response dependency

There were no positive residual correlations noticeably larger than the other correlations in any of the subscales. Correlations were in the range -0.28 to +0.09 for the function subscale and -0.36 to -0.10 for the pain subscale. Hence there was no evidence of response dependency in any of the subscale items.

Item fit

Item locations and their standard errors are shown in Table 1. These locations allow the ordering of the items in terms of the difficulty of the tasks to which they pertain. The first item in the function scale is Item 6 (avoid walking on hard or rough surfaces) with a location on the foot function scale of -1.339 logits, i.e. the analysis indicates that walking on rough or hard surfaces is the most difficult task on the scale for people with foot pain to perform and, hence, is avoided by those with even the mildest level of disability, as measured by the FPDI. Item 1 is the last item with a location of +2.166 logits, i.e. the analysis indicates that walking outside is the least difficult task on the scale and, hence, is avoided by only those with very poor function.

Table 1 Item locations and fit statistics for the 15 items of the FPDI function and pain subscales

Overall item fit as described by the mean (SD) item fit residual was good for the function and pain subscales (function: -0.217 (1.233); pain: 0.308 (1.187)). Table 1 shows the fit of the individual items. There was no misfit as measured by the item residuals or the chi-square fit statistic in either of the subscales, after applying the Bonferroni correction. In the pain scale, there was misfit on the F-test after Bonferroni correction (p = 0.0030) on the item relating to having constant pain. Figure 1 shows that this item is slightly over discriminating.

Figure 1
figure 1

Item characteristic curve for Item 14 (constant pain in feet).

Person fit

Overall person fit as described by the mean person fit (SD) residual was reasonable in both subscales (function: -0.312 (0.944); pain: -0.216 (0.999)).

In the function scale, three individuals had a person fit residual outside the range -2.5 to +2.5. In the pain scale, one person had a residual outside this range. With one exception, the residuals outside the acceptable range were negative and hence indicative of a purer Guttman pattern than expected by the Rasch model. In the function scale, one person had a residual greater than +2.5 because of a response pattern that was unexpected under the Rasch model. This person was removed from the analysis, but this did not change the overall fit of the data to the Rasch model. Hence it was decided to retain this person in the sample.

Overall model fit

The assumption of invariance along the latent trait held in both of the subscales, as evidenced by the item-trait interaction statistics (function: Χ2 = 23.543, df = 20, p = 0.2629; pain: Χ2 = 17.318, df = 10, p = 0.0676).

Differential Item Functioning

There was no DIF by gender on either of the subscales after Bonferroni correction (Table 2).

Table 2 Differential item functioning by gender and age for the 15 items of the FPDI pain and function subscales

The age groups used in the DIF analysis were of similar sizes (50 to 59 years, n = 46; 60 to 69 years, n = 47; 70 years and over, n = 56). There was no DIF by age group on the pain subscale as all p-values were greater than 0.01. On the function subscale, there was uniform DIF by age group (p = 0.0014) with those aged 60 years and over more likely to endorse the Item 6 (avoid rough or hard surfaces) than those aged 59 years and under (Figure 2). Attempts were made to correct for this DIF by treating this item separately for those aged 50 to 59 years and those aged 60 years and over. The subscale was also assessed with this item deleted. Neither of these strategies improved overall model fit and so it was decided to retain this item in the functioning subscale in its original form.

Figure 2
figure 2

Differential item functioning for age group in the functioning scale (Item 6, avoid walking on rough or hard surfaces).

Targeting

Figure 3 shows that although there are ceiling and floor effects in both the function and pain subscales, the item thresholds are generally spread along the continuum of the traits displayed by the sample. The mean (SD) person locations for the subscales were function: -0.965 (2.136) and pain: -0.522 (1.415). Both subscales have a negative person location, indicating that, the average item difficulty is higher than the average person disability. The pain subscale is better targeted than the function subscale.

Figure 3
figure 3

Person-threshold distribution maps. A Function subscale. B Pain subscale.

The Person Separation Index was acceptable for both subscales (function: 0.915; pain: 0.718), showing a good ability to distinguish between people along the latent traits [30].

Discussion

The FPDI is a measure of disability arising as a result of foot-pain that has been used in recent epidemiological studies and clinical trials [6, 1215]. In epidemiological studies, the FPDI has been used to produce a dichotomised measure of disability, that is, disability is either present or absent. Recent clinimetric studies and a clinical trial summated the seventeen ordinal items to produce a foot disability score ranging from 0 to 34 [12] or 17 to 51 [13, 15]. In the current study, we used the Rasch unidimensional measurement model [19] to obtain interval-level scores for the FPDI pain and function sub-scales.

These analyses have shown that the function and pain subscales of the FPDI are unidimensional and that interval level scores can be obtained from the items of these subscales. It was not possible to assess the measurement properties of the appearance subscale due to the small number of people without extreme responses on this subscale. This is perhaps not surprising, as the appearance subscale consists of only two items, making scoring problematic.

There was some evidence of differential item functioning (DIF) by age on the item relating to avoiding rough and hard surfaces on the function subscale, which could indicate a lack of unidimensionality in this subscale [31]. Attempts were made to correct for this by estimating the item location separately for the younger and older age groups [30]. However, this did not improve the model overall and made the scoring of the subscale more complicated, so this was not carried forward. The item could have been deleted, but this would have changed the subscale from its original form, which was not thought to be desirable. Instead, the item was retained. Furthermore, the original t-test of unidimensionality [27] and the residual correlations between items did not suggest that the function subscale breached unidimensionality. It could be that this item displays DIF because younger people, who are generally still employed, cannot avoid such surfaces or this DIF could have arisen as a result of the small sample size. However, the presence of this DIF and potential reasons for it should be confirmed in an independent sample. It seems likely though that this is a Type I statistical error.

There was also evidence of misfit, from the F-test, for the item relating to having constant pain in the feet but it was not considered necessary to attempt to correct this misfit because of the good fit on the residual and chi-square statistics. It is also known that the F-statistic is very sensitive to departures from fit to the Rasch model [29].

Although this study has investigated the Rasch measurement properties of the FPDI items for the first time, there are several limitations that deserve consideration. The moderate sample size used in this study may have reduced the ability of the analyses to detect misfit to the Rasch model. However, all categories of all items in the pain scale were endorsed by at least 10 people, as were 8 of the items in the function scale (Item 1: 5 people endorse most/every day(s), Item 11: 9 people endorsed most/every day(s)), generally meeting the minimum sample size requirement suggested by Linacre [35]. Although the sample size was only moderate, it had enough statistical power to detect the DIF displayed by Item 6 in the function subscale with respect to age group. Also, in this subscale, the p-value for the overall fit to the Rasch model, described by the item-person interaction chi-square statistic far exceeded the value of 0.05 required in order to find no evidence against the overall fit to the Rasch model.

A further caveat is that this analysis was undertaken in a population of adults aged 50 years and over from a relatively limited geographical area of the UK, and the sample was almost entirely from a white British background. Although Rasch analysis allows a score to be calibrated independently of the distribution of item responses in the sample [21], further analyses should be carried out in younger or more ethnically diverse populations before applying the scoring mechanism more widely. It may also be possible to use the Rasch-scored FPDI in a patient population, where disability would be expected to be more severe, as the population sample in this study had a much lower level of disability than the FPDI subscales were able to measure. Again, further analyses are needed before the FPDI subscales are used in this context and the Foot Impact Scale [36] has already been developing using Rasch analysis for use in populations with rheumatoid arthritis.

In order to be fully useful in clinical practice and research, the score needs to be transferable between populations. There are two main ways in which this could be carried out: the repeated use of the Rasch model or a conversion table. If the Rasch model were to be used in every dataset, a slightly different score range would result on each occasion, but this would allow people to gain a score even if they did not complete all of the items. This option also requires that the clinician or researcher have access to Rasch analysis software. The alternative option is to use a conversion table between a simple sum score of a person's responses (0, 1, 2 for each item) and the Rasch score. This type of table would be simpler, but would mean that those people who do not complete all of the items in the subscale cannot get a score. There is currently little guidance on in the literature on how to transfer a Rasch score between populations, and the final decision on how to do this should be made by the context of each individual study.

The availability of these interval-level subscale scores for function and pain in those with foot pain will allow the severity of disability to be more finely defined than has previously been possible with the dichotomisation of these subscales [6, 12, 14]. Whilst not necessarily replacing the dichotomous scoring methods suggested by Garrow et al [10] and Roddy et al [14], this interval-level scoring will allow more detailed research, for example looking at progression of disability, than is allowed for by the simple dichotomous measure. Interval-level scores will also allow the use of the FPDI in studies where the aim is to assess change in foot pain and disability severity over time or differences between groups. The interval-level nature of the Rasch person location estimates allows for the sensible investigation of change scores over time and between groups [16, 17].

However, with a continuum of disability, it is useful to have a definition of when a score is high enough to classify the individual person as being 'disabled', or when a change in the score over time is clinically significant. Hence, further work is needed to define clinically important changes on these subscales, such that they can be used more meaningfully in longitudinal research into foot disability.

Conclusion

The FPDI has been confirmed to have two unidimensional subscales in a general population of older adults in the UK: function and pain. These subscales appear to fit the Rasch measurement model and so an interval-level score can be produced for each subscale. Further work is needed to determine this fit in more general populations and to obtain a minimal clinically important change score for the subscales in order to make them more useful in practice. It may also be useful to further examine the two-item appearance subscale of the FPDI, although this may not be worthwhile due to the small number of items in this subscale.