Background

Neck pain is considered a notable social burden and has a high point prevalence (33%) within the adult population, and nearly 70% of people will experience neck pain at some point during their lifetime [4, 7, 8, 12, 16]. Clinical decision-making requires monitoring the treatment effect (improvement or deterioration) from both clinician and patient perspectives. The first patient-reported outcome measure (PROM) that assessed pain and disability in participants with neck pain was published in 1991 – the 10-item version of neck disability index (NDI-10 )[22]. The NDI-10 is the most studied neck-related PROM as it has been cited and applied in more than 300 publications [21]. It has been used widely in surgical treatment, injection therapies, physical therapy, as well as within exercise and research context [15, 16, 21]. Both a systematic review [16] and an overview [3] have reviewed a large volume of psychometric evidence on NDI with most studies suggesting that the NDI-10 has excellent classical psychometric properties, while a few studies have raised concerns about its factor structure, item relevance or scaling. The original version of the NDI-10 has been translated into 22 languages versions [9, 21].

The NDI-10 was developed as a unidimensional instrument assessing neck disability, with this as a fundamental requirement for using a single summary score [18,19,20]. The NDI-10 contains 10 items including pain intensity, personal care, lifting, reading, headaches, concentration, work, driving, sleeping, and recreation. Each item has 6 response options ranging from 0 to 5, where 0 represents the best situation and 5 represents the worst. Individual scores are summed to derive a total score from 0 to 50 with higher scores indicating more serious level of disability. Multiple items ask about pain and function together, which we consider to be more representative of the construct of pain-related functional interference. Through the problem elicitation technique (PET), others have concluded that the NDI-10 is a multidimensional scale that measures symptoms, impairments, and disabilities (work, recreation) [13].

Previous researchers have examined the NDI-10 using factor analysis, qualitative interview, and construct analysis under the classical test theory (CTT) [14]. Gabel et al. [10] concluded that the NDI-10 is a one-factor model confirmed by confirmatory factor analysis in a homogenous population with neck pain. However, others identified 2 factors using a principal component analysis [25].

Rasch analysis based on item response theory (IRT) and Rasch modelling enables examination of unidimensionality and interval level of scaling, and can lead to a transformation strategy to convert an ordinal score to interval scaling, which can validate the use of a total sum score [5]. Where outcome measures are not developed using Rasch modelling, they can retrospectively be evaluated for fit to the Rasch model which often result in suggested modifications needed to obtain fit. Several studies have inspected the NDI-10 using Rasch analysis and found violations of Rasch basic assumptions [10, 20, 24]. They offered solutions which included exclusion of misfit items and new coding algorithms. Although modified versions of NDI have been constructed that are conceptually and statistically sound, uptake has been limited and the traditional NDI-10 is still commonly used. Studies to date have focused on defining modified versions with better measurement properties but have not defined the extent to which these new versions differ from the traditional NDI-10 scoring outside of the development data set. Examining the amount of agreement between traditional and Rasch-based versions of the NDI using Bland-Altman (B&A) plots will inform our understanding of how these scores might differ [1, 2, 17].

Therefore, the objective of current study was to describe the extent of agreement between different versions of NDI in a sample of patients attending community clinics for neck pain.

Methods

Study design

The current study was a secondary data analysis where the study data was compiled from two prospectively collected data source. Both studies received ethical approval (McMaster Research Ethics Board (MREB) #03–145 and Hamilton Integrated Research Ethics Board (HiREB) #13–300) and all participants provided written, signed consent. Participants were recruited from community clinics presenting with neck pain in Hamilton, ON Canada through paper and online based survey.

Information source

We performed a comprehensive literature search to identify Rasch analyses of the NDI within four databases including Embase, Medline, PubMed, and Google Scholar. Search keywords were set as neck disability index, NDI, Rasch analysis, structural validity, construct validity. The search year range was limited until January 2020. Details of search strategies were presented in Appendix 1.

Study selection

An independent reviewer (ZL) performed the systematic electronic searches in all the databases. ZL also identified and removed the duplicate studies. The independent reviewer then carried out the screening of the titles/abstracts and identifying the full text articles. One author [JMacD] randomly reviewed 50% of the articles and discussed the disagreement with the first author to determine the final article eligibility.

Acceptable Rasch solutions

We included studies that applied the Rasch model to evaluate the structural validity of NDI. The score transformation algorithm was obtained if the revised version achieved an acceptable level of model fit identified by the eligibility criteria. According to assumptions of the Rasch theory, we defined the acceptable fit of the Rasch model as follows:

  1. 1.

    Unidimensionality was confirmed.

E.g. In studies using the Rasch analysis software, RUMM2030 (Rumm Laboratory, Australia) we used the common criterion that acceptable unidimensionality was present if the number of significant tests was less than 5% of the overall paired sample t-tests [19].

  1. 2.

    Overall test-fit statistic was examined by the Chi-square test; a non-significant p-value was acceptable.

  2. 3.

    Where response categories had disordered thresholds, strategies such as collapsing the adjacent response options were used as corrective actions, and the rescoring structure was reported and used to calculate revised NDI scores.

  3. 4.

    There was no differential item functioning (DIF), either uniform or non-uniform DIF, in the revised version.

  4. 5.

    Local dependency was assessed, and scale amendments taken where appropriate.

  5. 6.

    An appropriate level of the person separation index was demonstrated e.g. (PSI > 0.7)

Statistical procedures

The scores of alternate versions were computed. The demographic statistics of the sample including age, sex, total score of all included versions of NDI were described by mean, standard deviation (SD), median, interquartile range, minimum and maximum value. We performed the Wilcoxon signed rank test to perform a non-parametric comparison between NDI scores since the total score of NDI-10 was computed from ordinal scale.

Agreement of Rasch solutions

The normal distribution of mean differences of all three comparisons were inspected by the histogram. Using the B&A plots, we summarized the individual agreement between each of the identified NDI versions by the mean difference and the 95% limits of agreement (LoA; ±1.96 times the standard deviation).

To test the average agreement and differences between each NDI score, we examined the mean differences by one-sample t-test [11]. We reported the sample size for each comparison, the degree of freedom, mean differences with p-value and 95% confidence interval (CI), standard error of differences (SE).

Transformations including logarithmic and linear transformations were applied to normalize the non-uniform pattern of the bias on the plot. For instance, when the B&A plot shows a linear relationship between differences and means, (the differences measurement bias start with negative value and then becomes positive while the magnitude of the mean increases), we can regress differences between the methods (D) on the average of the two methods (A) by D = b1 × A + b0. The 95% LoA for the regression should build on the SD of the residual (SDres) from the established model (±1.96 times SDres) [1].

All analysis was performed by IBM SPSS statistics, Version 25.0 (IBM Corporation, Armonk, NY). We considered a significance level of p ≤ 0.05 as statistically significant.

Result

Study selection and NDI version identification

Initially, our search yielded 303 publications. After removing the duplications, 296 articles were left. Six studies were then selected for full text review after title and abstract review. Of these, two Rasch solutions that met the study criteria were identified from 2 individual studies including a 8-item version NDI (NDI-8) developed by Van Der Velde and colleagues [20] which was based on Rasch criteria, and a 5-item version NDI (NDI-5) developed by Walton and MacDermid [24] based on conceptual and Rasch criteria [24]. This allowed 3 B&A comparisons (NDI-10 vs. NDI-8, NDI-10 vs.NDI-5, and NDI-8 vs. NDI-5). The flowchart of studies through the selection process is displayed in Fig. 1.

Fig. 1
figure 1

Flow Diagram of study selection results based on PRISMA guideline

Ordinal score transformation

Three NDI scores were calculated for each participant. The first NDI score was derived from the original ordinal scale (maximum of 50 )[21]. We calculated second set of NDI scores according to the 8 item Rasch solution provided by Van Der Velde and collogues [20], where 2 items (headache and lifting) were removed and then, the ordinal scores were transferred to linear score with the maximum value of 50. For third score transformation, two steps were taken to derive the total score as recommended in a study that considered both conceptual issues and Rasch findings [24]. Firstly, 5 functional items regarding person care, concentration, working, driving, and recreation were kept into the total score calculation. A rescoring strategy, was then used to remedy the disordered threshold of driving related item [24]. The original score of responses (012345) was re-coded by collapsing the fourth and fifth options (012334), while the original structure (012345) was retained for other 4 items. Therefore, the maximum total score of NDI 5-item version was 24 on the ordinal scale. This score was transformed in to an equivalent ranging from 0 to 50 to enable the direct comparisons [24]. Please see Appendix 2 for a summary of transformations.

Sample

Table 1 describes the demographic information including age, pain intensity, total scores of NDI-10, NDI-8, and NDI-5 and stratified by sex. Thirty-one subjects experienced injury or trauma related neck-pain including car accident, sports injury, and fall. Other conditions leading to neck pain were arthritis, pinched nerves, and disc problems. The normal distribution of the mean differences of comparisons were confirmed by inspecting the histogram. See Figs. 2, 3, and 4. The Wilcoxon signed rank test revealed statistically significant differences between total scores from each two NDI versions (NDI-10 vs. NDI-8, NDI-10 vs. NDI-5, and NDI-8 vs. NDI-5). See Table 2.

Table 1 Demographic characteristic of the sample
Fig. 2
figure 2

Histogram of the difference comparing NDI 10-item total score with NDI 8-item total score. NDI: neck disability index

Fig. 3
figure 3

Histogram of the difference comparing NDI 10-item total score with NDI 5-item total score. NDI: neck disability index

Fig. 4
figure 4

Histogram of the difference comparing NDI 8-item total score with NDI 5-item total score. NDI: neck disability index

Table 2 Bland-Altman statistics and non-parametric comparisons by Wilcoxon signed rank test

Agreement of Rasch solutions

Table 2 demonstrated both average and individual agreement results of all three comparisons.

Through pairwise comparisons, we identified that the mean difference was approximately 10% of the total score between the NDI-10 and NDI-5 (− 4.6 points), whereas the NDI-10 versus NDI-8 and NDI-8 versus NDI-5had similar mean differences that were about half (− 2.3 points). We considered the NDI-10 as the reference method during comparisons, negative mean differences indicating that both NDI-8 and NDI-5 systematically scored higher than standard NDI-10 The B&A plots displayed wider 95% LoA for the agreement between NDI-10 and NDI-8 (− 12.0, 7.4) and NDI-5 (− 14.9, 5.8) compared with the agreement between the NDI-8 and NDI-5 (− 7.8, 3.3).

Through visual inspection of the Bland-Altman plot, the bias between NDI-10 and NDI-8 tended to be in opposite directions at different point in the scale range, as negative value of differences predominated in the lower end (before scores of 20) and positive values predominated in the high end of the scale (between 20 and 40). A similar trend was identified in the comparison between NDI-10 and NDI-5. However, such patterns were not present in the plot comparing NDI-8 with NDI-5. Please see Figs. 5, 6, 7.

Fig. 5
figure 5

Bland–Altman plots displaying 95% LoA in pair-wise comparison between NDI 10-item with NDI 8-item version. LoA: limits of agreement. NDI: neck disability index

Fig. 6
figure 6

Bland–Altman plots displaying 95% LoA in pair-wise comparison between NDI 10-item with NDI 5-item version. LoA: limits of agreement. NDI: neck disability index

Fig. 7
figure 7

Bland–Altman plots displaying 95% LoA in pair-wise comparison between NDI 8-item with NDI 5-item version. LoA: limits of agreement. NDI: neck disability index

The linear relationship on the B&A plot comparing NDI-8 with NDI-5was confirmed by the simple linear regression eq. D = − 0.2 × A + 2.2 with a significant p value for the over model and regression coefficient (p < 0.001) [1]. We then plotted 95% LoA based on the SDres which was equal to 2.4 from the regression model. The new upper and lower limited was constructed as D = − 0.2 × A + 2.189 ± 1.96 × 2.4. See Fig. 8.

Fig. 8
figure 8

Bland–Altman plots displaying 95% LoA in regression between NDI 8-item with NDI 5-item version as this varies across the range of the scores. LoA: limits of agreement. NDI: neck disability index

Discussion

We identified two Rasch approved versions of the NDI (NDI-8 and NDI-5) through a comprehensive literature review and revealed disagreements in score results within versions (NDI-10 vs. NDI-8 and NDI-5) using B&A plot analysis [11, 20, 24].. Such significant differences within versions were identified in non-parametric group comparisons. The wide range of the 95% LoA established surrounding the point estimate of the agreement would threaten the interchangeable application of different versions. When compared the traditional NDI-10 with the 8 items Rasch approved version, a difference of ranging from − 12.0 to 7.4 units accounting for nearly 15 to 25% of the total score was important for a measurement of 50 units, since 9 units of change would significantly influence the classification of the disability level [21]. For example, a participant who obtained a score of 20 on the traditional NDI-10 would be considered to have moderate level of neck disability. However, the LoAs between Rasch versions suggest that scores might fall within the mild or severe level a range from − 12.0 to 7.4 units. This reflects the extent of misclassification error that might occur on the basis of scoring. The bias between versions was even larger 30% (− 14.9 for lower limit) when comparing the NDI-10 with the NDI-5. The differences between NDI-8 and NDI-5 were uniform after linear transformation and were smaller than the discordance between the traditional and Rasch scored versions, with a mean variation of 4.7 units (10% of the total score). This smaller difference likely reflects some benefits of a Rasch approach, but also some differences related to the number of items included. This smaller error still suggests that these measures cannot be used interchangeably. An advantage of the NDI-8 is that it 8 items may exhibit more range or stability than a 5-item version. Conversely, the NDI-5 is more focused conceptually since it focuses on function, and it reduces respondent burden. Head-to-head comparison of how these two versions performed in measuring clinical outcomes over time are needed to evaluate their relative utility.

The unstable variance in error patterns on B&A plot were problematic for comparing across Rasch versions, even though they had small error limits (− 2.3 and − 4.6). Through visual inspection, the direction of bias reverted when the scores approaching 20 points, approximately mid-range. Attempts including both logarithmic and linear transformation failed to normalize the bias pattern. The more extreme bias displayed at the upper and lower ends of the scale is reflective of the ordinal nature of the original 0–50 score, whereas the NDI-5 and NDI-8 have been linearly converted through the Rasch analytic process. This may explain why similar patterns were observed between the NDI-10 vs. NDI-8, and NDI-10 vs. NDI 5, but a different pattern was shown between the NDI-8 vs. NDI-5. Our data further illustrated that the original ordinal scale ranging from 0 to 50 should not be used in parametric statistical analyses, due to the violation of interval level scaling.

The differences between the NDI-8 and NDI-5 could be due to the variations in the retained items, both in terms of their content and the associated ‘difficulty’ level of the items. Firstly, fewer items are likely to result in a narrower measurement range coverage, and therefore the scale may be ‘stretched out’ when converted back to a 0–50 score. The smaller differences between the NDI-8 and NDI-5 may have been driven by methodologic differences in how these analyses were performed. In the NDI-8, the items (headache and lifting) were deleted based on Rasch findings drive by the goal of achieving optimal model fit [20]. For the 5-item version, the authors conducted a 2-stage process first deleting items for conceptual reasons and then proceeding to a Rasch analysis. The conceptual framework of the International Classification of Functioning, Disability and Health (ICF) was used to refine the item pool as to those that fit within the disability construct the symptom-based item such as pain intensity was removed at this stage [24]. This retention of symptoms in the NDI-8 and its exclusion from NDI-5 might explain the small systematic errors between the two Rasch-based versions. Researchers might select between these two versions based on these conceptual issues. For example, NDI-8 provides the evaluation of neck disability regarding pain intensity, sleeping, and reading. Conversely, the NDI-5 focuses on function and would require that pain be measured in a different standardized measure, since this is clearly an important issue for people suffering from neck pain. The NDI-5 might allow for clearer distinction between pain and function constructs, but the point at which measures become too short is not clear. Our qualitative work with patients with neck pain suggested that patients want comprehensive consideration of a broad array of life impacts that resulted from neck pain [23].

Finally, there is an update in terms of setting the acceptable level of the local independence which may resulting in the variation of constructing Rasch approved models since the examination of local independence is considered as one important test of assumption under Rasch modelling. Van Der Velde et al. [20] defined the critical residual correlation coefficient should be larger than 0.3 to confirm the presence of LD, where as Walton and MacDermid [24] adopted the criterion of LD being0.2 above the average residual correlation, rather than the straight cuff-off of 0.3 [6, 20, 24]. These methodologic differences may have affected the final versions defined by authors.

Despite the differences in different versions of the NDI and the concerns about the scoring of the full NDI, a benefit of the complete 10 items version is that the score can be transformed into either modified version, whereas this is not the case if either of the 5 or 8 items versions are administered [20, 24].

Strengths & limitations

The literature review only examined studies published in the English language, which may limit the identification of other potential Rasch solutions of NDI. The study sample was recruited from community clinics in a single city in Canada which restricts the generalizability of study findings.

Implications

Rasch-based scoring may improve the validity and interpretability of the NDI. Future studies should examine other clinical measurement properties in a head-to-head comparison of the NDI-8 and NDI-5, particularly responsiveness users select between the NDI-5 and NDI-8.

Conclusion

The traditional NDI-10 should not be used interchangeably with either of two Rasch-approved shorter versions. The conceptual difference between the NDI-5 and NDI should be considered during the decision of NDI-8 and NDI-5.