Background

The measurement of functional ability is important in many contexts. While there often seems to be agreement as to the content of instruments for evaluation of function, there is relatively less consensus about the scaling of items. Item scaling vary in the number of response categories, the wording of category options and the use of all-point (where all categories are defined) or end-point (where only end-points are defined) scales [1, 2]. The majority of health status and patient-reported outcome measures use all-point defined scales with between two and seven categories, the most popular being five-point scales including the agree/disagree Likert format. The generic Short Form 36-item (SF-36) Health Survey [3] uses five-point scales for seven of the eight health scales it includes. Other generic instruments such as the Nottingham Health Profile (NHP) [4] and EuroQol EQ-5D [5] use two- and three-point scales respectively. In the WHO Health and Work Performance Questionnaire, functional status is reported using different scales with between four and 11 points [6].

It has been argued that seven-point response scales are the maximum number that individuals are able to process [7] and some authors have advocated their use [8]. However, such scales are not widely used possibly because of the difficulty of finding suitable adjectives when seven all-point defined scales are used. Seven categories are also harder to fit across a page of A4 with a reasonably sized typeface. However, if the number of alternatives is less than the rater's ability to discriminate, the result may be a loss of information [2, 9]. There is evidence that the reduction in reliability from ten to seven categories is quite small, but the use of five categories reduces the reliability by about 12 percent [2]. Hence it is argued that the minimum number of categories should be in the region of five to seven [2]. One review concluded that seven plus or minus two appears to be a reasonable range for the optimal number of response alternatives [9]. More recently, it was found that respondents preferences were highest for a ten-point scale followed by seven-point and nine-point scales [10]. The respondents rated scales with five, seven and ten response categories as relatively easy to use. Scales with two, three or four response categories were rated as relatively quick to use, but were unfavourable in terms of the extent to which they allowed the respondents to express their feelings adequately. If a scale does not allow respondents to express themselves, they may become frustrated or demotivated and the quality of their responses may decrease [10].

Previous research has shown that the greater the number of response options, the more reliable the scale is likely to be [11]. Simulations of categorization error have consistently shown that correlation between true values and scale scores increase with the number of response options [12]. Scales with relatively few response alternatives tend to generate scores with comparatively little variance, thereby limiting the magnitude of correlations with other scales [13, 14]. The reduction in reliability is most severe for scales with four categories or less, but tends to level off once seven or more options are available. However, there is often a trade-off between scale reliability and ease of administration [11]. One study using the NHP indicated that the psychometric performance and patient acceptability was improved by using a five-point scale instead of the original shorter response format [15].

Following a recent systematic review, it was recommended that future research designs should allocate respondents to different versions of a questionnaire to compare approaches to item scaling [1]. Our study considered two different all-point defined scales using four and five response alternatives. The Norwegian Functional Assessment Scale (NFAS) was included in a large Norwegian population study on musculoskeletal pain, The Ullensaker Study 2004, to obtain self-reported levels of functional ability. Eligible persons were randomised to receive NFAS with the original four-point scale or a five-point scale.

The aim of this study was to compare the original four-point with the new five-point scale version by evaluating validity of the NFAS in a population. This will determine which version should be used in the future applications.

Methods

Study setting and sample

Ullensaker is a rural community which had 23,700 inhabitants in 2004. There are no major differences between the population of Ullensaker and the general population of Norway with respect to demographic characteristics [16]. In 2004, postal questionnaires, which included the NFAS along with questions relating to musculoskeletal pain, were sent to all 6108 inhabitants in Ullensaker municipality in the birth cohorts 1918–20, 1928–30, 1938–40, 1948–50, 1958–60, 1968–70 and 1978–80. Reminders were sent at eight weeks.

The sample was computer-randomised by an external company to either the four-point or the five-point scale version, herein referred to as the NFAS-4 and the NFAS-5. The Ullensaker Study questionnaire also included the Dartmouth COOP Functional Health Assessment Charts/WONCA(COOP/WONCA), General Health Questionnaire-20 (GHQ-20), Standardized Nordic Questionnaire, work ability, sickness absenteeism, and occupation.

The Regional Committee for Medical Research Ethics and The Norwegian Data Inspectorate approved the study.

The Norwegian Function Assessment Scale (NFAS)

The Norwegian Function Assessment Scale (NFAS) is a self-report instrument developed by an expert group in social insurance in 2000 and is designed to assess the need for rehabilitation, adjustment of work demands among sick-listed persons as well as the rights to social security benefits [17]. The scale comprises 39 items derived directly from the activities/participation dimension in the International Classification of Functioning, Disability and Health (ICF) [18]. The items are relevant for assessing physical and mental functioning in working life, some relating to activities of daily living. The NFAS starts with the question "Have you had difficulty doing the following activities during the last week?" and respondents report 39 activities using a four-point scale: no difficulty, some difficulty, much difficulty, could not do it. The five all-point defined scale was developed to be more congruent with the qualifiers in the activities/participation dimension of ICF [19]: no difficulty, mild difficulty, moderate difficulty, much difficulty and could not do it.

Based on the results of principal component analysis from the previous study with sick-listed persons [17], the items form seven domains: Walking/standing (7 items), Holding/picking up things (8 items), Lifting/carrying (6 items), Sitting (3 items), Managing (7 items), Cooperation/communication (6 items), Senses (2 items). These domains have evidence for validity in sick listed persons [17]. The main application of the NFAS is likely to be social insurance. Hence it was decided to keep the domains from the earlier study with sick-listed persons [17]. It should, however, be anticipated that principal component analysis based on data from the general population in Ullensaker will yield somewhat different results. The first four and the last three domains are intuitively grouped into physical and mental domains respectively. Domain scores are calculated by adding the item scores and dividing by the number of items completed. NFAS total scores are calculated by adding all 39 item scores and dividing by the number of items completed. Low scores indicate good functional ability.

COOP/WONCA

COOP/WONCA [20] is a generic health status measure, where functional status is self-reported with a time frame of the previous two weeks. It comprises six charts: Physical fitness, Feelings, Daily activities, Social activities, Overall health and Change in health. Each chart has five response alternatives with pictorial representations. The present study used an optional Pain chart in place of the Change in health chart.

General Health Questionnaire (GHQ-20)

Psychological distress during the last two weeks was measured by the GHQ-20 [21], a widely used screening instrument for measuring non-psychotic psychiatric illness in a general population. Items are scored as the original GHQ score in a bi-modal fashion (0-0-1-1) [22].

Work ability was assessed by one question "To what degree is your ability to perform your ordinary work reduced today: hardly reduced at all, not much reduced, moderately reduced, much reduced and very much reduced" [23]. Respondents were asked to report whether they had experienced any pain or discomfort in ten different body regions during the previous week [24]. Sickness absenteeism was assessed by asking the respondents if they had been sick-listed during the previous year: no, less than 1 week, between 1–8 weeks, more than 8 weeks. Occupation was assessed with the categories: employed, housekeeping/full-time household work, unemployed, medical rehabilitation, disability pension, retired or student.

Statistical analyses

Data quality

The two versions of the NFAS were compared for levels of missing data, and floor and ceiling effects, which were expressed as percentages.

Tests of scaling assumptions

Internal consistency was assessed by item-total correlation and Cronbach's alpha. Item-total correlation coefficients should meet 0.40 standard. Cronbach's alpha was considered acceptable for group comparisons when the coefficient exceeded 0.70 [25]. Item discriminant validity was assessed by analyzing correlations between the items and their domains (item-total) and between the items and the other domains (item-other) to see if the former was at least two standard errors higher than the latter, thereby indicating definite scaling success [26].

Construct validity

We hypothesised that scores from conceptually related domains of NFAS would correlate higher than scores of unrelated domains. We also hypothesised that NFAS scores would correlate higher with conceptually corresponding aspects of the COOP/WONCA, GHQ and Work Ability than with non-corresponding aspects. Correlation coefficients among measures of the same attribute should fall in the midrange of 0.40 – 0.80 [2].

It was hypothesised that those having a disability pension or rehabilitation benefit due to disease and those reporting being sick-listed previous year, would report lower functional ability. We also compared domain scores between those reporting musculoskeletal pain last week without mental distress (original GHQ score <4) and those with mental distress (original GHQ score ≥ 4) but no musculoskeletal pain. It was hypothesised that females, older persons and persons with shorter education would report lower functional ability than the males, younger persons and persons with longer education. Since data are categorical, non-parametric tests for independent samples were used to compare subgroups.

Results

Sample characteristics

Of the 6108 questionnaires posted, 3325 (54.4%) were returned. The response rate was lower for males (p < 0.001) and young or very old persons (p < 0.001) (Table 1). The response rates for the two versions were 54.0% for NFAS-4 and 54.8% for NFAS-5. 55 participants in birth cohort 1968–70 randomised to the NFAS-4 were erroneously mailed the NFAS-5 version. Hence, the subsamples differed significantly regarding age (p < 0.05), but not on any other background variables. Excluding the birth cohort 1968–1970 did not affect the results.

Table 1 Response rates by age and gender for the NFAS-4 and the NFAS-5 (N = 3325)

Data quality

For respondents to the NFAS-4 and NFAS-5, there were no missing data for 78.5% and 82.4% respectively. All items had more missing data for the NFAS-4 than NFAS-5 (Table 2). The mean levels of missing data for individual items in the NFAS-4 and NFAS-5 were 3.3% and 2.6% respectively, which was statistically significant (p < 0.01). The same items within both versions had the highest percentage of missing values.

Table 2 Missing data, means and end effects for NFAS-4 and NFAS-5 items (N = 3325)

Item responses were skewed towards no difficulty for both versions (Table 2). The percentage of respondents reporting no difficulty for all 39 items was 33.1% in the NFAS-4 and 30.6% in the NFAS-5. In the general the NFAS-4 items had larger floor and ceiling effects than NFAS-5 items; some differences were statistically significant (p < 0.05) (Table 2). The third response alternative in NFAS-4 and the fourth in NFAS-5 had exact the same wording, "much difficulty", but the percentage response was lower in NFAS-5 than in NFAS-4 for 24 items.

Scaling assumptions

All items in both versions met the 0.40 criterion for item-total correlation with the exception of the two items in the "senses" domain in NFAS-4 (Table 3). In all domains, item-total correlation coefficients were higher within the NFAS-5 than within NFAS-4, and this difference was significant for 35 items.

Table 3 Mean item-total correlation and Cronbach's alpha for domain scores in the NFAS-4 and the NFAS-5 (N = 3325)

All items, except four in the NFAS-4 and one in the NFAS-5, met the item-discriminant validity criterion. Cronbach's alpha for two of the NFAS-4 and one of the NFAS-5 domains just failed to meet the 0.70 criterion (Table 3). Cronbach's alphas were significantly higher for NFAS-5 across the first six domains and the total score.

Construct validity

For both versions, scores from conceptually related domains of NFAS correlated higher than scores of unrelated domains (Table 4). The NFAS-5 produced the largest correlations between domains and between domains and total scores, which was significant (p < 0.05) for 15 items and four domains.

Table 4 Correlationa between NFAS, COOP/WONCA, GHQ-20 and Work ability for the NFAS-4 and the NFAS-5 (N = 3325)

NFAS scores correlated higher with conceptually corresponding aspects of the COOP/WONCA, GHQ and Work Ability than with non-corresponding aspects for both versions (Table 4). The Sitting and Senses domains had relatively low correlations with these items or scales. The correlation coefficients were similar for the two versions. With only one exception, all the correlations hypothesized as being high, were over 0.40, indicating that the same construct was being measured by the NFAS and the external standard.

Both versions discriminated between persons anticipated to report different levels of functional ability, including persons with disability pension or medical rehabilitation, persons reporting sickness absence, and persons with physical versus mental symptoms (Table 5).

Table 5 Domain scores for different groups of the study population for the NFAS-4 and the NFAS-5 (N = 3325)

For both versions, a decline in physical functional ability was significantly associated with increasing age (p < 0.05). With one exception, males reported significantly better functional ability (p < 0.001) for both versions. With the exception of the Senses domain for the NFAS-4, a significant education gradient was found for both versions (p < 0.001).

Applying age-stratified analyses, the results for data quality, scaling assumptions and construct validity remained stable.

Discussion

Both versions demonstrated low levels of missing data and skewed response distribution, but the NFAS-4 had more missing values and larger end effects than NFAS-5. The NFAS-5 demonstrated better internal consistency and item-discriminant validity than the NFAS-4, although the results were acceptable for both versions. All a priori hypotheses were met, which strongly supports the construct validity of the scale for both versions. Both versions discriminated similarly well between groups with different levels of health status and between known groups in the population.

Data quality

The response rates and the low levels of missing data show that both versions of the NFAS are acceptable to the population. A few items had a high percentage of missing values, which is probably because there was no "not applicable" option. Significantly less missing data for the NFAS-5 than the NFAS-4 is some indication that the respondents found it easier choosing a suitable response from the five-point scale. This finding is supported by Nagata et al. [27], who compared feasibility of health measurement response scales using four, five and seven categories and a visual analog scale. The level of missing data was least and the responder preference was highest, for the five-point scale version.

Since the NFAS data are skewed towards higher levels of functioning, the larger end effects for NFAS-4 have to be considered when the instrument is used to discriminate between different levels of functioning or to assess changes in functioning over time. It is likely that NFAS-4 will not be as responsive to changes in functioning, simply because it has fewer response options that individuals can use to indicate that their functioning has changed.

It might be anticipated that the response alternative, "much difficulty", along with the two end categories would show similar percentages in the two versions. This was not found. Hence, the responses did not seem to be affected by the wording or anchoring of the response alternatives.

Internal consistency and validity

The internal consistency values were similar to widely used instruments including the SF-36 [28, 29, 2933] and the NHP [15]. Our item-other domain correlation coefficients were comparable with other study results using the SF-36 in a study including rheumatoid arthritis patients [34] and a population study [29].

Regarding construct validity, different time perspectives in the questioning for the different scales could influence possible associations since Work Ability concerns today, NFAS last week, COOP/WONCA and GHQ the last two weeks. However, all a priori hypotheses correlation coefficients met the 0.4 – 0.8 standard. Other studies have obtained similar correlation coefficients between NHP and SF-36 scales [15, 34] or between SF-36 scale scores and comparable item or domain scores from other questionnaires [32, 35]. Regarding the ability to discriminate between groups with different levels of health status, comparable results were found for the SF-36 [3033, 35]. A gender difference was found in several studies [28, 3032, 3537], but not all [33, 38]. The finding of a physical age gradient is supported by several studies [28, 32, 33, 3538], and an education gradient has also been found in previous research [28, 30, 31, 35, 38].

The NFAS-5 demonstrated somewhat higher internal consistency and item-discriminant validity values compared to the NFAS-4. The majority of this difference could probably be attributed to the fact that correlation between true values and scale scores increase with the number of response options [12], but it is not known whether this explains the whole difference in correlation coefficient values.

Future applications of the NFAS

The items in the NFAS are derived directly from the activities/participation dimension in the ICF. The ICF use a five-point scale for their qualifiers and the clinical checklists. This supports the use of the NFAS-5. The NFAS-5 had lower levels of missing data than the NFAS-4 which may indicate higher responder acceptability. The NFAS-5 generally performed better than the NFAS-4 in relation to the psychometric tests. Therefore the five-point scale is recommended in future applications of the NFAS. The main drawback in changing to a new response format is that it precludes direct comparisons between previous and new research. However, following our study results, we believe that the evidence supports changing the NFAS response format to a five-point scale.

Strengths and limitations

This study' strengths include the randomised design, the large study sample, the good data quality and the thorough testing of validity against other standards. The moderate response rate and that all data is self-reported, represent study limitations. An external, unrelated variable would have strengthened validity assessment. With the present study design it was not possible to ask the respondents about their preferences [10] or to determine the sensitivity to change, the responsiveness of the scale. However, the low mean missing values may indicate acceptability among respondents.

Conclusion

The data quality of NFAS is high with acceptable internal consistency and good construct validity. In choosing between the four-point and the five-point scale, it should be noted that while construct validity and discriminative ability are comparable, both data quality, internal consistency and discriminative validity suggest that the five-point scale is to be preferred in future applications of the NFAS.