Validation of the work ability index—single item and the pain disability index—work item in patients with chronic low back pain

A cross-sectional and longitudinal study was conducted to analyse construct validity, responsiveness, and Minimal Clinically Important Change (MCIC) in the Work Ability Score (WAS) and Pain Disability Index Work item (PDI-W) in patients with Chronic Low Back Pain (CLBP). Construct validity was assessed by testing predefined hypotheses. Responsiveness and MCIC were measured with an anchor-based method. The area under the receiver Operating Characteristic Curve (AUC) and the optimal cut-off point were calculated. Smallest Detectable Change (SDC) was calculated to determine measurement error. In total, 1502 patients (age 18–65 years) with CLBP were included. For validity of the WAS and PDI-W, respectively, seven and six out of 10 hypotheses were not rejected. The WAS (n = 355) was responsive to change with an AUC of 0.70. MCIC was 1.5 point, SDCindividual 4.9, and SDCgroup 0.3. MCICs were 4.5, 1.5, and − 0.5 points for, respectively, low, middle, and high scoring baseline groups. The PDI-W (n = 297) was responsive to change with an AUC of 0.80. MCIC was − 2.5 points, SDCindividual 5.2, and SDCgroup 0.3. MCICs were − 0.5, − 2.5, and − 4.5 points for, respectively, low, middle, and high scoring baseline groups. Construct validity of the WAS and PDI-W was insufficient in this patient sample. The WAS and PDI-W are responsive to change. On average, improvements of 1.5 point (WAS) and  − 2.5 points (PDI-W) were interpreted as clinically important. However, MCICs are also baseline dependent. Due to a risk of measurement error, at the individual level change scores should be interpreted with caution.


Introduction
The prevalence of disability due to Low Back Pain (LBP) increases from the third decade of life on, peaking between the age of 35 and 55 years [1]. LBP causes a large number of absenteeism and work productivity losses [2]. This makes LBP the most common health problem in the European workforce. In the Netherlands, costs of LBP have been estimated at 1.7% of its Gross National Product [3].
The greatest potential for cost reduction is decreasing work absenteeism and disability due to LBP [4]. Absenteeism and disability at work are influenced by the work ability of a person [5]. Higher work ability is associated with less disability and pain, and higher quality of life [6]. The Work Ability Index (WAI) was developed as a measure for selfreported work ability. The Work Ability Score (WAS) is an item of the WAI and compares current work ability with lifetime best [7]. It is an acceptable brief alternative for the WAI in determining work ability [8]. Convergent validity between the WAI and WAS is sufficient [9]. Measurement properties are sufficient in a secondary vocational rehabilitation setting [10], but have not been analysed in a secondary and tertiary spine care. The interference of chronic pain with daily activities can be assessed by the Pain Disability Index (PDI). The PDI has been validated in patients with chronic pain [11]. The PDI Work item measures interference of chronic pain with the ability to engage in occupational activities. The PDI-W has also not been validated yet.
The WAS and PDI-W are Patient Reported Outcome Measures (PROMs). PROMs are highly recommended in clinical guidelines to assess the quality of care, treatment effects and change in health status from the patient's perspective. Selection of PROMs should be based on the strength of relevant measurement characteristics (i.e. validity, responsiveness) [12]. To meet conditions for construct validity, a measurement instrument should be consistent with hypotheses regarding relationships with other measures. The ability to detect changes in health status within individuals over time (responsiveness) and interpretation of change scores are important characteristics of PROMs [13]. Minimal Clinically Important Change (MCIC) and measurement error (Smallest Detectable Change, SDC) can be used to interpret change scores. The MCIC is useful as this change score is perceived as beneficial and meaningful to patients [14].
Despite the usefulness of PROMs, these measurements can be a burden for patients and caregivers. Length of time to fill out the questionnaires, difficulty in completing them independently, and length of time to analyse the results were the most frequently mentioned reasons for not using the measurements [15]. Therefore, if measurement characteristics are sufficient, the WAS and PDI-W single items may be used in routine care instead of lengthy questionnaires. The aim of the present study was to assess construct validity, responsiveness, and MCIC of the WAS and PDI-W in patients with CLBP.

Study design
Data collection took place in the Groningen Spine Center, a university-based multidisciplinary secondary and tertiary care center in the Netherlands, as part of a research project called Groningen Spine Cohort (GSC) [16]. Patients were included from July 2015 to September 2018. Patients digitally filled out a set of questionnaires at baseline (T0) and 6 months follow-up (T1), including the WAS, PDI-W, a modified version of the short Copenhagen Psychosocial Questionnaire (COPSOQ II), the EuroQol-5D (EQ5D), the National Institutes of Health (NIH) Impact Stratification, and a Global Perceived Effect (GPE) scale. The Medical Ethical Committee of the University Medical Center Groningen provided a waiver (M15.169472) for the data collection of the GSC with respect to medical ethical permission. Informed consent was signed after being informed on the purpose of the study. Handling of the data was done in accordance with the guideline for Good Research Practice [17].

Patients and setting
Patients admitted to the Groningen Spine Center between 18 and 65 years old, experiencing lower back and/or leg pain for more than 12 weeks were included. Patients with insufficient understanding of Dutch language or no Internet access were excluded. Patients who did not respond to follow-up questionnaires were excluded for assessment of longitudinal validity. Retired and permanent disabled patients, and stay at home parents were also excluded for analyses, because improvement of work ability was not a treatment goal for these patients. Care as usual was provided between baseline and follow-up. Treatment options were minimal intervention (pain education and information), referral for primary care treatment (e.g. physiotherapy), multidisciplinary rehabilitation, surgery, pain anaesthesiology treatment, and/or other if needed (e.g. referral to an additional medical specialist) [16].

Primary measures
The WAS measures current work ability compared to lifetime best. Scores range from 0 (completely unable to work) to 10 (work ability at best period). Scores are classified as poor (0-5 points), moderate (6-7 points), good (8-9 points), and excellent (10 points) work ability. The WAS has been demonstrated to be a good brief alternative for the lengthy WAI in determining work ability [8]. Convergent validity is acceptable between the WAI and WAS [9]. Measurement properties in a secondary rehabilitation setting were sufficient [10]. The PDI-W measures self-reported work disability, including paying and non-paying jobs. The wording of PDI-W is "This category refers to activities that are part of or directly related to one's job. This includes non-paying jobs as well, such as that of a housewife or volunteer". Patients rate their level of pain interference on each item on a scale ranging from 0 (no pain interference) to 10 (total pain interference). Patients can also choose the option 'not applicable'. The PDI-W is an item of the PDI, which consists of 7 items. The PDI is internally consistent (Cronbach's alpha = 0.87) with good convergent validity. Test-retest reliability is good (Intraclass Correlation Coefficient [ICC] = 0.76) [11]. Measurement properties of the PDI-W are unknown.

Secondary measures
Absenteeism was measured for employed patients by calculating the percentage missed hours of work in the past month (0 to 100%). Missing data on weekly work hours (in 14% of employed patients) and weekly work days (in 7% of employed patients) were handled with multiple imputation. Constraints were set on the minimum and maximum number of work hours (between 2 and 40 h) and work days (between 1 and 5) per week.
Quality of life was measured with the EQ5D consisting of five dimensions: mobility, self-care, activities of daily living, pain/discomfort, and anxiety/depression. These dimensions are measured on three levels (no problems, some problems, extreme problems) [18]. EQ5D scores can be transformed into a utility value by means of a validated algorithm, ranging from -0.33 to 1.00 [19]. Reliability and validity of the ED5Q are acceptable [20].
Psychosocial work environment was measured with the COPSOQ II which consists of 25 items covering domains work demands, work organization and job contents, leadership and interpersonal relationships, work-individual interface, and values at the workplace. Test-retest reliability is adequate to good [21].The scale considering work demands was used in the present study (6 questions, 0-4, 4 being highest self-reported work demands), consisting of items concerning work pace, quantitative and emotional demands. The NIH Impact Stratification measures self-reported impact of pain on functioning. The Impact Stratification total score is realized by nine items: a Numeric Rating Scale (NRS) of pain intensity (range 0-10), four items on physical functioning (range 1-5), and four items on pain interference (range 1-5). The total score ranges from 8 (least impact) to 50 (most impact). The impact is classified as mild (8-27 points), moderate (28-35 points), or severe (≥ 35 points) [22]. Test-retest reliability is sufficient and longitudinal validity is adequate [23,24]. The NRS pain score was also used as single item in the present study.
For work status, patients answered the question whether they were currently employed. If yes, patients were asked about the status of the employment (working, sick leave, partial sick leave). In addition, an item of the NIH minimal dataset concerning permanent disablement was used [22].
The GPE was used as external criterion. The GPE measures patient-rated assessment of treatment outcome [25]. The question "How much did your treated complaints change compared with pre-treatment level?" was answered at 6 months follow-up (T1). Responses range from 0 to 6 on a 7-point Likert scale: 0, extremely worsened; 1, much worsened; 2, little worsened; 3, unchanged; 4, little improved; 5, much improved; and 6, completely improved. Strong correlations are reported between GPE scores and changes in disability and pain [26]. There is limited evidence for validity of the GPE scale [27]. Nevertheless, the GPE is recommended and considered a suitable anchor [28].

Data analyses
Patient characteristics were reported using descriptive statistics. Continuous data are presented as means and standard deviations or medians and Interquartile Range (IQR), depending on data distribution. Categorical data are reported as frequencies with percentages. P < 0.05 was considered statistically significant. Statistics were performed using IBM SPSS for Windows (version 24.0, Armonk, NY: IBM Corp.).
Construct validity was assessed using elaborate hypothesis testing by calculating Pearson or Spearman's rank correlation coefficients between the WAS and PDI-W on the one hand, and absenteeism, EQ5D, COPSOQ II work demands, NIH Impact Stratification, NRS pain score, and items concerning work status on the other hand. The strength of the association is expressed in the absolute value of the correlation coefficient. Correlation coefficients were interpreted as follows: 0.00-0.25 little if any, 0.26-0.49 weak, 0.50-0.69 moderate, 0.70-0.89 strong, and 0.90-1.00 very strong correlation [29]. Hypotheses were formulated based on previous studies. Weak correlations of the WAS and PDI-W with work demands [30], partial sick leave and permanent disablement [31,32] were expected. We expected weak to moderate correlations of the WAS and PDI-W with absenteeism [5], NRS pain score [32], working [33], and sick leave [31,32]. Finally, we expected moderate correlations of the WAS and PDI-W with quality of life [34] and impact of pain on functioning [35]. Correlations of the WAS and PDI-W with absenteeism and work demands were calculated for the 'employed' patient group (n = 901). Because a floor or ceiling effect of maximally 15% is considered acceptable [36], this hypothesis was not rejected if < 15% of patients achieved the lowest or highest possible score (0-10, respectively). Support for construct validity was considered when ≥ 80% of the predefined hypotheses were not rejected [36].
For hypothesis testing, we used the COSMIN Study Design checklist for Patient reported outcome measurement instruments [37]. The section 'Hypotheses testing for construct validity' includes the following requirement: "Perform the analysis in a sample with an appropriate number of patients (taking into account expected number of missing values)". A patient sample of ≥ 100 patients is considered "very good". The patient samples for hypothesis testing in the present study varied from 884 to 1502 patients. Therefore, we did not include a sample size and power calculation.
Responsiveness and MCIC were calculated in accordance with the Consensus-based Standards for the Selection of Health Measurement Instruments criteria (COSMIN) [13,37]. Two categories were defined: improved (i.e. much improved and completely improved) and unimproved (all other scores). The area under the Receiver Operating Characteristics (ROC) curve (AUC) was calculated with a 95% Confidence Interval (CI). AUCs were interpreted as excellent 1.0-0.90, good 0.90-0.80, fair 0.80-0.70, poor 0.70-0.60, and failed 0.60-0.50 [38]. Therefore, an AUC ≥ 0.70 was considered responsive. The Optimal Cut-off Point (OCP) of the AUC was determined to calculate the MCIC. This is the point closest to the top-left corner of the ROC curve, which maximizes the proportion of correctly classified patients according to the GPE. The sum of squares of 1-sensitivity and 1-specificity is minimal at this point [39]. Cut-off values depend on the baseline level of a measurement [40]. To take the effect of baseline scores into account, secondary analyses were performed measuring responsiveness and MCICs for different baseline-score groups. Three subgroups were formed based on tertiles of baseline scores on the WAS and PDI-W. In addition, a percentage change score was calculated for the PDI-W to take the effect of baseline scores into account. For the WAS, an inverted percentage change score was calculated since calculation of a percentage change score resulted in a high frequency of missing values due to division by zero. The inverted percentage change score was calculated by the following equation: The Standard Error of Measurement (SEM) and Smallest Detectable Change (SCD) were calculated to determine measurement error. The SEM represents the standard deviation of repeated measurements in one patient. The SEM was calculated by the square root of the variability between baseline and 6 months follow-up added to the variability due to random error (Eq. 2) in patients reporting unimproved scores on the GPE [41]. The SPSS VARCOMP analysis command was used to retrieve the variance component values.
The SDC represents the minimal change in score that a patient has to show to ensure that the observed change is real and not caused by measurement error. To distinguish clinically important change from measurement error, SDC should be smaller than the MCIC. The SDC can be calculated by the following equations on individual (Eq. 3) and group level (Eq. 4), in which 1.96 refers to a 95% CI and √ 2 is a correction for repeated measurements [41,42]: (1) Inverted percentage change score = change score maximum score − baseline score × 100

Patients
Baseline data were available for 1502 patients. Demographic and clinical variables are presented in Table 1. The median of the WAS was 4.0 (IQR 1.0-6.0) points, and the median of the PDI-W was 7.0 (IQR 5.0-8.0) points.

Hypothesis testing
For the WAS, 7 of 10 hypotheses were not rejected (70%) ( Table 2). For the PDI-W, 6 of 10 hypotheses were not rejected (60%). Therefore, construct validity of the WAS and PDI-W was not supported.

Responsiveness and minimal clinically important change
The WAS showed fair responsiveness with an AUC of 0.70 (0.63-0.76). The PDI-W showed a good responsiveness with an AUC of 0.80 (0.74-0.87) ( Table 3; Fig. 1). The MCIC was 1.5 point for the WAS with a sensitivity of 0.67 and specificity of 0.61. The MCIC was − 2.5 points for the PDI-W with sensitivity of 0.74 and specificity of 0.75.

Baseline-score groups
For the PDI-W, MCICs were − 0.5, − 2.5, and − 4.5 points for, respectively, low, middle, and high scoring baseline groups ( Table 4). The MCIC of the percentage change score was 56%. For the WAS, MCICs were 4.5, 1.5, and -0.5 points for, respectively, low, middle, and high scoring baseline groups ( Table 5). The MCIC of the inverted percentage change score was 38%.

Measurement error
The SEM was 1.9 points for the PDI-W with an SDC individual of 5.2 and an SDC group of 0.3 points. The SEM for the WAS was 1.8 points with an SDC individual of 4.9 and an SDC group of 0.3 points.

Discussion
The aim of this study was to assess construct validity, responsiveness, and MCIC of the WAS and PDI-W in patients with CLBP. For the WAS and PDI-W, respectively, 70% and 60% of predefined hypotheses were not rejected, which is lower than the threshold that was set at ≥ 80%. Therefore, construct validity was not supported. The WAS and PDI-W are responsive to change. MCICs of 1.5 point (WAS) and -2.5 points (PDI-W) were found. Nevertheless, clinically important change could not be distinguished from measurement error, since MCICs were smaller than SDC values. Individual change scores up to 5 points should be interpreted with caution. For construct validity, rejection of more hypotheses than expected can be explained by different reasons. For the WAS, measurement scales of reference instruments might have contributed to the rejection of hypotheses. The WAS asks to compare current work ability to lifetime best, whereas reference instruments only ask for current functioning. Consequently, loss of functioning might have been scored differently resulting in lower correlations. Additionally, the work demands hypothesis focused on work pace, emotional and quantitative demands (i.e. evaluation of time available to finish work), because these questions were classified as work demands by the COPSOQII questionnaire. In retrospect, this construct could have been expanded by inclusion of physical demands, commitment to work, and job satisfaction. These factors are considered important factors in predicting work (dis)ability [30]. Hypotheses on partial sick leave were rejected. Contrary to the scores on the WAS and PDI-W for patients on sick leave or fully working that were heavily skewed to the lower or higher end of the scale, scores for patients on partial sick leave were normally distributed with a high variance. Therefore we observed lower correlations than a priori hypothesized. Finally, in hindsight, permanent disablement was an insufficient reference test. The majority of permanently disabled patients scored the PDI-W as 'not applicable', because this item was irrelevant to these patients.
Floor (WAS; 25%) and ceiling (PDI-W; 15%) effects were also observed, both indicating most severe interference of LBP. These effects might have affected correlations with reference tests. Data were collected from patients receiving secondary and tertiary multispecialty care. Consumption of medical care and the influence of LBP on work ability are higher in this patient sample, compared to patients receiving primary level care [16]. Therefore, the WAS and PDI-W might not be adequate instruments for distinguishing work ability levels in patients with severe CLBP. Further research should investigate the validity of these items in patients receiving primary care level.
Regarding longitudinal validity, measurement error should be considered in decision-making in individual patients. The SDC individual for the WAS (4.9 points) and PDI-W (5.2 points) both exceeded the MCIC values (respectively, 1.5 and -2.5 points). This corresponds with results of previous research on PROMs in back pain [11,43].  Individual change scores larger than the MCIC but smaller than the SDC individual should be interpreted with caution.
These scores fall within the measurement error, which results in the risk of incorrect classification of patients as improved. The WAS and PDI-W are better at detecting changes at a group level, as the SDC group was smaller than the SDC individual . Results of the present study indicate that small changes in work ability can be considered important by CLBP patients. Because CLBP is very disabling [7,8], small improvement can have meaningful effect on wellbeing of patients. For interpretation of individual change scores, the effect of baseline scores should be taken into account [40]. Higher (PDI-W) or lower (WAS) baseline values (both indicating worse work ability) require higher MCIC values, because there is a greater potential for improvement [44]. The results of the present study confirm that MCICs for the WAS and PDI-W are baseline dependent. This is supported by the (inverted) percentage change scores, which were 39% (WAS) and 56% (PDI-W).
Patient burden is an important consideration in selecting measurement instruments. If patient burden is decreased by using single items instead of lengthy questionnaires, then slightly less sufficient measurement characteristics might be acceptable. For example, when patients have to fill out multiple questionnaires or in frequent evaluations (e.g. daily or weekly) assessing work ability trends. In addition, the WAS can be considered on group level and large-scale surveys [45]. The WAS is also suitable for systematic application during medical examinations in occupational health care or in public health surveys [9].
A methodological consideration is the dichotomization of the external criterion into improved and unimproved patient groups. The improved group consisted of patients reporting to be much improved and completely improved. Table 2 Correlations between the WAS (n = 1502) and PDI-W (n = 1381) and reference tests, and floor and ceiling effects r = correlation coefficient, absolute value. The direction of the association is depending on the scoring of the reference test WAS, work ability score; n, number of patients; PDI-W, pain disability index work item; EQ5D, Euroqol-5D; COPSOQ II, Copenhagen Psychosocial Questionnaire modified version; NIH, National Institutes of Health; NRS, numeric rating scale * p = .04. All other correlations were significant (p < .01). a n = 884-901 to the improved group [46]. However, other research stated that little improvement is in the range of natural fluctuation   [47]. When little improved patients are considered improved, accuracy to differentiate between improved and unimproved patients decreases [44]. In order to better reflect the concept of meaningful improvement, little improved patients were not classified as improved.
In addition, the patient sample was chosen based on relevance, because increasing work ability is not a treatment goal for all patients. Therefore, retired and permanently disabled patients, and stay at home parents were excluded. We included these patients in a sensitivity analysis to test the accuracy of the applied relevance criterion. Results reported the same MCICs for total group and baseline score groups as reported for the patient sample selected on relevance. Only the PDI-W percentage change score differed, which was 41% instead of 56%. The PDI-W takes unpaid work into account, including that of housework or volunteer. This is also carried out by the excluded patients. It is possible that small improvements in interference of pain with unpaid work are considered important, resulting in a lower percentage change score.
The effect of treatment should also be included in future research. For measurement of responsiveness, it is required to ensure that a proportion of patients is likely to change [37]. Not knowing for how many patients, when, what type of treatment takes place means that normally it would be difficult to predict whether a proportion of patients is likely to change within the utilized time interval of one year between baseline and follow-up. However, previous studies on patients from the GSC have shown that approximately a third shows clinically relevant improvement on measures of disability and impact of LBP one year after baseline measurement [16,48]. Therefore, we expected a similar proportion of our patient sample to improve on work ability during follow-up.

Conclusion
Based on predefined hypotheses, construct validity of the WAS and PDI-W was not supported in this patient sample. The WAS and PDI-W are responsive to change in CLBP Table 5 Responsiveness and Minimal Clinically Important Change of the WAS (n = 335) for baseline-score groups *Significant change between T0 and T1 (p < .001) **Inverted percentage change score: (change score / (max score -baseline score)) × 100 a Agreement with GPE: percentage of patients with 1) a GPE 'improved' in combination with a clinically important change in score, or 2) a GPE 'improved' in combination with no clinically important change in score WAS, work ability score; n, number of patients; SD, standard deviation; min, minimum; max, maximum; CI, confidence interval; AUC, area under the curve; GPE; global perceived effect; MCIC, minimal clinically important change; OCP, optimal cut-off point Authors' contributions All authors contributed to the study conception and design. The literature search was performed by I Boekel, and data analysis was performed by I Boekel and AL Dutmer. The first draft of the manuscript was written by I Boekel, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding No funding was received for conducting this study.

Conflicts of interest
The authors have no relevant financial or nonfinancial interests to disclose.

Ethics approval
The Medical Ethical Committee of the University Medical Center Groningen provided a waiver (M15.169472) for the data collection of the GSC with respect to medical ethical permission.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Informed consent Informed consent was obtained from all individual participants included in the study.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.