Introduction

The prevalence of disability due to Low Back Pain (LBP) increases from the third decade of life on, peaking between the age of 35 and 55 years [1]. LBP causes a large number of absenteeism and work productivity losses [2]. This makes LBP the most common health problem in the European workforce. In the Netherlands, costs of LBP have been estimated at 1.7% of its Gross National Product [3].

The greatest potential for cost reduction is decreasing work absenteeism and disability due to LBP [4]. Absenteeism and disability at work are influenced by the work ability of a person [5]. Higher work ability is associated with less disability and pain, and higher quality of life [6]. The Work Ability Index (WAI) was developed as a measure for self-reported work ability. The Work Ability Score (WAS) is an item of the WAI and compares current work ability with lifetime best [7]. It is an acceptable brief alternative for the WAI in determining work ability [8]. Convergent validity between the WAI and WAS is sufficient [9]. Measurement properties are sufficient in a secondary vocational rehabilitation setting [10], but have not been analysed in a secondary and tertiary spine care. The interference of chronic pain with daily activities can be assessed by the Pain Disability Index (PDI). The PDI has been validated in patients with chronic pain [11]. The PDI Work item measures interference of chronic pain with the ability to engage in occupational activities. The PDI-W has also not been validated yet.

The WAS and PDI-W are Patient Reported Outcome Measures (PROMs). PROMs are highly recommended in clinical guidelines to assess the quality of care, treatment effects and change in health status from the patient’s perspective. Selection of PROMs should be based on the strength of relevant measurement characteristics (i.e. validity, responsiveness) [12]. To meet conditions for construct validity, a measurement instrument should be consistent with hypotheses regarding relationships with other measures. The ability to detect changes in health status within individuals over time (responsiveness) and interpretation of change scores are important characteristics of PROMs [13]. Minimal Clinically Important Change (MCIC) and measurement error (Smallest Detectable Change, SDC) can be used to interpret change scores. The MCIC is useful as this change score is perceived as beneficial and meaningful to patients [14].

Despite the usefulness of PROMs, these measurements can be a burden for patients and caregivers. Length of time to fill out the questionnaires, difficulty in completing them independently, and length of time to analyse the results were the most frequently mentioned reasons for not using the measurements [15]. Therefore, if measurement characteristics are sufficient, the WAS and PDI-W single items may be used in routine care instead of lengthy questionnaires. The aim of the present study was to assess construct validity, responsiveness, and MCIC of the WAS and PDI-W in patients with CLBP.

Materials and methods

Study design

Data collection took place in the Groningen Spine Center, a university-based multidisciplinary secondary and tertiary care center in the Netherlands, as part of a research project called Groningen Spine Cohort (GSC) [16]. Patients were included from July 2015 to September 2018. Patients digitally filled out a set of questionnaires at baseline (T0) and 6 months follow-up (T1), including the WAS, PDI-W, a modified version of the short Copenhagen Psychosocial Questionnaire (COPSOQ II), the EuroQol-5D (EQ5D), the National Institutes of Health (NIH) Impact Stratification, and a Global Perceived Effect (GPE) scale. The Medical Ethical Committee of the University Medical Center Groningen provided a waiver (M15.169472) for the data collection of the GSC with respect to medical ethical permission. Informed consent was signed after being informed on the purpose of the study. Handling of the data was done in accordance with the guideline for Good Research Practice [17].

Patients and setting

Patients admitted to the Groningen Spine Center between 18 and 65 years old, experiencing lower back and/or leg pain for more than 12 weeks were included. Patients with insufficient understanding of Dutch language or no Internet access were excluded. Patients who did not respond to follow-up questionnaires were excluded for assessment of longitudinal validity. Retired and permanent disabled patients, and stay at home parents were also excluded for analyses, because improvement of work ability was not a treatment goal for these patients. Care as usual was provided between baseline and follow-up. Treatment options were minimal intervention (pain education and information), referral for primary care treatment (e.g. physiotherapy), multidisciplinary rehabilitation, surgery, pain anaesthesiology treatment, and/or other if needed (e.g. referral to an additional medical specialist) [16].

Measurements

Primary measures

The WAS measures current work ability compared to lifetime best. Scores range from 0 (completely unable to work) to 10 (work ability at best period). Scores are classified as poor (0–5 points), moderate (6–7 points), good (8–9 points), and excellent (10 points) work ability. The WAS has been demonstrated to be a good brief alternative for the lengthy WAI in determining work ability [8]. Convergent validity is acceptable between the WAI and WAS [9]. Measurement properties in a secondary rehabilitation setting were sufficient [10].

The PDI-W measures self-reported work disability, including paying and non-paying jobs. The wording of PDI-W is “This category refers to activities that are part of or directly related to one’s job. This includes non-paying jobs as well, such as that of a housewife or volunteer”. Patients rate their level of pain interference on each item on a scale ranging from 0 (no pain interference) to 10 (total pain interference). Patients can also choose the option ‘not applicable’. The PDI-W is an item of the PDI, which consists of 7 items. The PDI is internally consistent (Cronbach’s alpha = 0.87) with good convergent validity. Test–retest reliability is good (Intraclass Correlation Coefficient [ICC] = 0.76) [11]. Measurement properties of the PDI-W are unknown.

Secondary measures

Absenteeism was measured for employed patients by calculating the percentage missed hours of work in the past month (0 to 100%). Missing data on weekly work hours (in 14% of employed patients) and weekly work days (in 7% of employed patients) were handled with multiple imputation. Constraints were set on the minimum and maximum number of work hours (between 2 and 40 h) and work days (between 1 and 5) per week.

Quality of life was measured with the EQ5D consisting of five dimensions: mobility, self-care, activities of daily living, pain/discomfort, and anxiety/depression. These dimensions are measured on three levels (no problems, some problems, extreme problems) [18]. EQ5D scores can be transformed into a utility value by means of a validated algorithm, ranging from -0.33 to 1.00 [19]. Reliability and validity of the ED5Q are acceptable [20].

Psychosocial work environment was measured with the COPSOQ II which consists of 25 items covering domains work demands, work organization and job contents, leadership and interpersonal relationships, work-individual interface, and values at the workplace. Test–retest reliability is adequate to good [21].The scale considering work demands was used in the present study (6 questions, 0–4, 4 being highest self-reported work demands), consisting of items concerning work pace, quantitative and emotional demands.

The NIH Impact Stratification measures self-reported impact of pain on functioning. The Impact Stratification total score is realized by nine items: a Numeric Rating Scale (NRS) of pain intensity (range 0–10), four items on physical functioning (range 1–5), and four items on pain interference (range 1–5). The total score ranges from 8 (least impact) to 50 (most impact). The impact is classified as mild (8–27 points), moderate (28–35 points), or severe (≥ 35 points) [22]. Test–retest reliability is sufficient and longitudinal validity is adequate [23, 24]. The NRS pain score was also used as single item in the present study.

For work status, patients answered the question whether they were currently employed. If yes, patients were asked about the status of the employment (working, sick leave, partial sick leave). In addition, an item of the NIH minimal dataset concerning permanent disablement was used [22].

The GPE was used as external criterion. The GPE measures patient-rated assessment of treatment outcome [25]. The question “How much did your treated complaints change compared with pre-treatment level?” was answered at 6 months follow-up (T1). Responses range from 0 to 6 on a 7-point Likert scale: 0, extremely worsened; 1, much worsened; 2, little worsened; 3, unchanged; 4, little improved; 5, much improved; and 6, completely improved. Strong correlations are reported between GPE scores and changes in disability and pain [26]. There is limited evidence for validity of the GPE scale [27]. Nevertheless, the GPE is recommended and considered a suitable anchor [28].

Data analyses

Patient characteristics were reported using descriptive statistics. Continuous data are presented as means and standard deviations or medians and Interquartile Range (IQR), depending on data distribution. Categorical data are reported as frequencies with percentages. P < 0.05 was considered statistically significant. Statistics were performed using IBM SPSS for Windows (version 24.0, Armonk, NY: IBM Corp.).

Construct validity was assessed using elaborate hypothesis testing by calculating Pearson or Spearman’s rank correlation coefficients between the WAS and PDI-W on the one hand, and absenteeism, EQ5D, COPSOQ II work demands, NIH Impact Stratification, NRS pain score, and items concerning work status on the other hand. The strength of the association is expressed in the absolute value of the correlation coefficient. Correlation coefficients were interpreted as follows: 0.00–0.25 little if any, 0.26–0.49 weak, 0.50–0.69 moderate, 0.70–0.89 strong, and 0.90–1.00 very strong correlation [29]. Hypotheses were formulated based on previous studies. Weak correlations of the WAS and PDI-W with work demands [30], partial sick leave and permanent disablement [31, 32] were expected. We expected weak to moderate correlations of the WAS and PDI-W with absenteeism [5], NRS pain score [32], working [33], and sick leave [31, 32]. Finally, we expected moderate correlations of the WAS and PDI-W with quality of life [34] and impact of pain on functioning [35]. Correlations of the WAS and PDI-W with absenteeism and work demands were calculated for the ‘employed’ patient group (n = 901). Because a floor or ceiling effect of maximally 15% is considered acceptable [36], this hypothesis was not rejected if < 15% of patients achieved the lowest or highest possible score (0–10, respectively). Support for construct validity was considered when ≥ 80% of the predefined hypotheses were not rejected [36].

For hypothesis testing, we used the COSMIN Study Design checklist for Patient reported outcome measurement instruments [37]. The section 'Hypotheses testing for construct validity' includes the following requirement: "Perform the analysis in a sample with an appropriate number of patients (taking into account expected number of missing values)". A patient sample of ≥ 100 patients is considered “very good”. The patient samples for hypothesis testing in the present study varied from 884 to 1502 patients. Therefore, we did not include a sample size and power calculation.

Responsiveness and MCIC were calculated in accordance with the Consensus-based Standards for the Selection of Health Measurement Instruments criteria (COSMIN) [13, 37]. Two categories were defined: improved (i.e. much improved and completely improved) and unimproved (all other scores). The area under the Receiver Operating Characteristics (ROC) curve (AUC) was calculated with a 95% Confidence Interval (CI). AUCs were interpreted as excellent 1.0–0.90, good 0.90–0.80, fair 0.80–0.70, poor 0.70–0.60, and failed 0.60–0.50 [38]. Therefore, an AUC ≥ 0.70 was considered responsive. The Optimal Cut-off Point (OCP) of the AUC was determined to calculate the MCIC. This is the point closest to the top-left corner of the ROC curve, which maximizes the proportion of correctly classified patients according to the GPE. The sum of squares of 1-sensitivity and 1-specificity is minimal at this point [39]. Cut-off values depend on the baseline level of a measurement [40]. To take the effect of baseline scores into account, secondary analyses were performed measuring responsiveness and MCICs for different baseline-score groups. Three subgroups were formed based on tertiles of baseline scores on the WAS and PDI-W. In addition, a percentage change score was calculated for the PDI-W to take the effect of baseline scores into account. For the WAS, an inverted percentage change score was calculated since calculation of a percentage change score resulted in a high frequency of missing values due to division by zero. The inverted percentage change score was calculated by the following equation:

$${\text{Inverted}}\;\;{\text{ percentage}}\;\;{\text{ change }}\;\;{\text{score}} = \frac{{{\text{change}}\;\;{\text{ score}}}}{{{\text{maximum }}\;\;{\text{score}} - {\text{baseline}}\;\;{\text{ score}}}} \times 100$$
(1)

The Standard Error of Measurement (SEM) and Smallest Detectable Change (SCD) were calculated to determine measurement error. The SEM represents the standard deviation of repeated measurements in one patient. The SEM was calculated by the square root of the variability between baseline and 6 months follow-up added to the variability due to random error (Eq. 2) in patients reporting unimproved scores on the GPE [41]. The SPSS VARCOMP analysis command was used to retrieve the variance component values.

$${\texttt {SEM}} = \sqrt {{\sigma}_{time}^{2} + {\sigma}_{error}^{2} }$$
(2)

The SDC represents the minimal change in score that a patient has to show to ensure that the observed change is real and not caused by measurement error. To distinguish clinically important change from measurement error, SDC should be smaller than the MCIC. The SDC can be calculated by the following equations on individual (Eq. 3) and group level (Eq. 4), in which 1.96 refers to a 95% CI and \(\sqrt 2\) is a correction for repeated measurements [41, 42]:

$${\text{SDC}}_{{{\text{individual}}}} = 1.96 \times \sqrt 2 \times SEM$$
(3)
$${\text{SDC}}_{{{\text{group}}}} = \frac{{{\text{SDC}}_{{{\text{individual}}}} }}{\sqrt n }$$
(4)

Results

Patients

Baseline data were available for 1502 patients. Demographic and clinical variables are presented in Table 1. The median of the WAS was 4.0 (IQR 1.0–6.0) points, and the median of the PDI-W was 7.0 (IQR 5.0–8.0) points.

Table 1 Patient characteristics at baseline (n = 1502)

Hypothesis testing

For the WAS, 7 of 10 hypotheses were not rejected (70%) (Table 2). For the PDI-W, 6 of 10 hypotheses were not rejected (60%). Therefore, construct validity of the WAS and PDI-W was not supported.

Table 2 Correlations between the WAS (n = 1502) and PDI-W (n = 1381) and reference tests, and floor and ceiling effects

Responsiveness and minimal clinically important change

The WAS showed fair responsiveness with an AUC of 0.70 (0.63–0.76). The PDI-W showed a good responsiveness with an AUC of 0.80 (0.74–0.87) (Table 3; Fig. 1). The MCIC was 1.5 point for the WAS with a sensitivity of 0.67 and specificity of 0.61. The MCIC was  − 2.5 points for the PDI-W with sensitivity of 0.74 and specificity of 0.75.

Table 3 Responsiveness and Minimal Clinically Important Change in the WAS (n = 335) and PDI-W (n = 297)
Fig. 1
figure 1

Receiver Operating Characteristic curves (ROC) of the Work Ability Score (WAS) and Pain Disability Index – Work item (PDI-W). AUC, area under the curve. A. ROC-curve of WAS (n = 335). B. ROC-curve of PDI-W (n = 297)

Baseline-score groups

For the PDI-W, MCICs were  − 0.5, − 2.5, and − 4.5 points for, respectively, low, middle, and high scoring baseline groups (Table 4). The MCIC of the percentage change score was 56%. For the WAS, MCICs were 4.5, 1.5, and -0.5 points for, respectively, low, middle, and high scoring baseline groups (Table 5). The MCIC of the inverted percentage change score was 38%.

Table 4 Responsiveness and Minimal Clinically Important Change in the PDI-W (n = 297) for baseline-score groups
Table 5 Responsiveness and Minimal Clinically Important Change of the WAS (n = 335) for baseline-score groups

Measurement error

The SEM was 1.9 points for the PDI-W with an SDCindividual of 5.2 and an SDCgroup of 0.3 points. The SEM for the WAS was 1.8 points with an SDCindividual of 4.9 and an SDCgroup of 0.3 points.

Discussion

The aim of this study was to assess construct validity, responsiveness, and MCIC of the WAS and PDI-W in patients with CLBP. For the WAS and PDI-W, respectively, 70% and 60% of predefined hypotheses were not rejected, which is lower than the threshold that was set at ≥ 80%. Therefore, construct validity was not supported. The WAS and PDI-W are responsive to change. MCICs of 1.5 point (WAS) and -2.5 points (PDI-W) were found. Nevertheless, clinically important change could not be distinguished from measurement error, since MCICs were smaller than SDC values. Individual change scores up to 5 points should be interpreted with caution.

For construct validity, rejection of more hypotheses than expected can be explained by different reasons. For the WAS, measurement scales of reference instruments might have contributed to the rejection of hypotheses. The WAS asks to compare current work ability to lifetime best, whereas reference instruments only ask for current functioning. Consequently, loss of functioning might have been scored differently resulting in lower correlations. Additionally, the work demands hypothesis focused on work pace, emotional and quantitative demands (i.e. evaluation of time available to finish work), because these questions were classified as work demands by the COPSOQII questionnaire. In retrospect, this construct could have been expanded by inclusion of physical demands, commitment to work, and job satisfaction. These factors are considered important factors in predicting work (dis)ability [30]. Hypotheses on partial sick leave were rejected. Contrary to the scores on the WAS and PDI-W for patients on sick leave or fully working that were heavily skewed to the lower or higher end of the scale, scores for patients on partial sick leave were normally distributed with a high variance. Therefore we observed lower correlations than a priori hypothesized. Finally, in hindsight, permanent disablement was an insufficient reference test. The majority of permanently disabled patients scored the PDI-W as ‘not applicable’, because this item was irrelevant to these patients.

Floor (WAS; 25%) and ceiling (PDI-W; 15%) effects were also observed, both indicating most severe interference of LBP. These effects might have affected correlations with reference tests. Data were collected from patients receiving secondary and tertiary multispecialty care. Consumption of medical care and the influence of LBP on work ability are higher in this patient sample, compared to patients receiving primary level care [16]. Therefore, the WAS and PDI-W might not be adequate instruments for distinguishing work ability levels in patients with severe CLBP. Further research should investigate the validity of these items in patients receiving primary care level.

Regarding longitudinal validity, measurement error should be considered in decision-making in individual patients. The SDCindividual for the WAS (4.9 points) and PDI-W (5.2 points) both exceeded the MCIC values (respectively, 1.5 and -2.5 points). This corresponds with results of previous research on PROMs in back pain [11, 43]. Individual change scores larger than the MCIC but smaller than the SDCindividual should be interpreted with caution. These scores fall within the measurement error, which results in the risk of incorrect classification of patients as improved. The WAS and PDI-W are better at detecting changes at a group level, as the SDCgroup was smaller than the SDCindividual. Results of the present study indicate that small changes in work ability can be considered important by CLBP patients. Because CLBP is very disabling [7, 8], small improvement can have meaningful effect on well-being of patients.

For interpretation of individual change scores, the effect of baseline scores should be taken into account [40]. Higher (PDI-W) or lower (WAS) baseline values (both indicating worse work ability) require higher MCIC values, because there is a greater potential for improvement [44]. The results of the present study confirm that MCICs for the WAS and PDI-W are baseline dependent. This is supported by the (inverted) percentage change scores, which were 39% (WAS) and 56% (PDI-W).

Patient burden is an important consideration in selecting measurement instruments. If patient burden is decreased by using single items instead of lengthy questionnaires, then slightly less sufficient measurement characteristics might be acceptable. For example, when patients have to fill out multiple questionnaires or in frequent evaluations (e.g. daily or weekly) assessing work ability trends. In addition, the WAS can be considered on group level and large-scale surveys [45]. The WAS is also suitable for systematic application during medical examinations in occupational health care or in public health surveys [9].

A methodological consideration is the dichotomization of the external criterion into improved and unimproved patient groups. The improved group consisted of patients reporting to be much improved and completely improved. Only 20% of patients was classified as improved. Previous research stated that little improved patients can be added to the improved group [46]. However, other research stated that little improvement is in the range of natural fluctuation [47]. When little improved patients are considered improved, accuracy to differentiate between improved and unimproved patients decreases [44]. In order to better reflect the concept of meaningful improvement, little improved patients were not classified as improved.

In addition, the patient sample was chosen based on relevance, because increasing work ability is not a treatment goal for all patients. Therefore, retired and permanently disabled patients, and stay at home parents were excluded. We included these patients in a sensitivity analysis to test the accuracy of the applied relevance criterion. Results reported the same MCICs for total group and baseline score groups as reported for the patient sample selected on relevance. Only the PDI-W percentage change score differed, which was 41% instead of 56%. The PDI-W takes unpaid work into account, including that of housework or volunteer. This is also carried out by the excluded patients. It is possible that small improvements in interference of pain with unpaid work are considered important, resulting in a lower percentage change score.

The effect of treatment should also be included in future research. For measurement of responsiveness, it is required to ensure that a proportion of patients is likely to change [37]. Not knowing for how many patients, when, what type of treatment takes place means that normally it would be difficult to predict whether a proportion of patients is likely to change within the utilized time interval of one year between baseline and follow-up. However, previous studies on patients from the GSC have shown that approximately a third shows clinically relevant improvement on measures of disability and impact of LBP one year after baseline measurement [16, 48]. Therefore, we expected a similar proportion of our patient sample to improve on work ability during follow-up.

Conclusion

Based on predefined hypotheses, construct validity of the WAS and PDI-W was not supported in this patient sample. The WAS and PDI-W are responsive to change in CLBP patients. Overall, an improvement of 1.5 point (WAS) and  − 2.5 points (PDI-W) can be interpreted as clinically important change. However, MCIC values are baseline dependent. Clinically important change cannot be distinguished from measurement error, since MCICs were smaller than SDC values. Therefore, cautious interpretation is necessary for individual change scores up to 5 points.