Introduction

The assessment of whether an employee is able to participate in work is complex (Slebus et al. 2007). According to the World Health Organizations’ International Classification of Functioning, Disability, and Health (ICF), participation depends on the following five components: disease and disorder, functions and structures, activities, environmental factors, and personal factors (WHO 2001). In case of a disease or disorder, the assessment of whether or not a patient is able to work is often performed by physicians and is traditionally based on legislation, administrative rules, and the physicians’ expertise (De Boer et al. 2009). These assessments are performed for return-to-work decisions and for disability claim assessments. For most physicians, these assessments consist of a comparison between the work ability of a patient and the required demands of a job (Söderberg and Alexanderson 2005; Slebus et al. 2007). Where the work ability matches the job, a person is considered to be able to participate in work. Since there are few instruments available to support physicians in these assessments, it is not surprising that the reliability—a major indicator of an instrument’s measurement quality—of these assessments performed by physicians specifically trained for these tasks varied between “poor” and “good” (Brouwer et al. 2003; Spanjer et al. 2010; Slebus et al. 2010).

For the assessment of work ability in patients with musculoskeletal disorders (MSDs), reliable questionnaires and performance-based measures are available (Wind et al. 2005). A theoretical advantage of the use of performance-based measures above questionnaires might be that the face validity is higher: After all, a client performs work-related activities in a specific environmental context (Soer et al. 2008). In line with this assumption, Wind et al. (2009a) showed that performance-based information was found to have complementary value in the assessment of the physical work ability of claimants with MSDs according to 68% of the physicians. In addition, these same physicians change their judgment of the physical work ability of claimants with MSDs in the context of disability claim procedures more often when performance-based outcomes are provided versus traditional information obtained from anamnesis and the medical file (Wind et al. 2009b). Despite these supportive findings for the use of performance-based measures in the assessment for work participation in patients with MSDs, a recent Cochrane review concluded that there is no evidence available for or against the effectiveness of performance-based measures compared with no assessment as intervention for preventing occupational re-injuries in workers with MSDs (Mahmud et al. 2010). The predictive validity of these measures for work participation, however, was not studied. Until now, it is only known that the assessment of work ability in patients with MSDs using a patient’s questionnaire, a clinical examination by a physician or by performance-based measures resulted in large differences regarding the estimated work ability (Brouwer et al. 2005). The questionnaire resulted in the highest amount of work limitations and in the performance-based measures in the lowest amount. Therefore, to shed more light on the predictive validity of performance-based measures for the participation in work, a systematic review was performed to answer the following question: “How well do performance-based measures predict work participation in patients with MSDs?” As far as we know, this review is the first on the predictive validity of performance-based tests for work participation since the review of Innes and Straker (1999). Their review demonstrated paucity in studies focussing on predictive validity. The answer to the research question is relevant because few instruments are available to support physicians in work ability assessments and performance-based measures are not often used (De Boer et al. 2009; Wind et al. 2006), probably partly due to its unknown value for work participation.

Methods

A systematic review of the literature was performed. The following five sources of information were used to retrieve relevant studies: PubMed (until October 21, 2010), Embase (until October 21, 2010), reference list of Chapter 21 of the American Medical Association Guide to the Evaluation of Functional Ability (Genovese and Galper 2009), references of the included papers were also checked for other potentially relevant papers, and relevant papers suggested by the authors based on their expertise and their personal file. The search terms for PubMed and Embase are listed in "Appendix A" and were based on the PubMed prognosis filter and the search terms for work as suggested by Schaafsma et al. (2006).

After checking for duplicates, the following inclusion criteria were applied to the title and abstract by two reviewers (PK and VG or MFD):

  • The paper is a primary study;

  • The population of interest are employees with MSDs;

  • The study design is a prospective or retrospective cohort study or an intervention study (in the latter case, the data of the group tested with a performance-based measure were used);

  • The paper describes a reliable physical test of performance;

  • The outcome measure is work participation such as in return to work, or being employed, or a surrogate like the termination of a disability claim;

  • The result of a physical test of performance is statistically related to the outcome measure;

  • The paper is written in English, Dutch, German, French, or Italian.

If title and abstract did not provide enough information to decide whether the inclusion criteria were met, the full paper was checked. Next, the inclusion criteria were applied to the full paper. When doubts existed about whether a paper fulfilled the inclusion criteria, one other researcher (VG or MFD) was consulted and a decision was made based on consensus. Finally, the references of the included papers were also checked for other potentially relevant papers.

Quality description

The quality description of the selected studies was based on an established criteria list for assessing the validity of prognostic studies, as recommended by Altman (2001) and modified by Scholten-Peeters et al. (2003) and Cornelius et al. (2010). This list consisted of 16 items, each having yes/no/don’t know answer options. This modified criteria list is presented in "Appendix B". The quality of all included studies was independently scored by two reviewers (PK, VG). If the study complied with the criterion, the item was rated with one point. If the study did not comply with the criterion or when the information was not described or unclear, then the item was rated with zero points. In case of disagreement, the two reviewers came to a decision through mutual agreement. For the total quality score, all points of each study were added together (maximum score is 16 points). Studies achieving a score of at least 13 points (≥81%) were considered to be of good quality, at least 9 (56%) and a maximum of 12 points (75%) of moderate quality, and those with 8 points (50%) or less of low quality.

Data extraction

Data were extracted by the first author using a standardized form (PK). The following information was extracted as follows: primary author, year of publication, country, study design (cohort (retrospective or prospective) or intervention), characteristics of the population (i.e., number of employees, age and type of MSD), description of the treatment, description of the reliable performance-based test, the confounders taken into account, and the main result of the study regarding the performance-based test and work participation, and a summary of whether the test was significantly related to work participation (yes, no). A distinction was made between studies with good, moderate, and poor quality based on the quality description.

Evidence synthesis

For the best evidence synthesis, we used the following rules adapted from Van Tulder et al. (2003) and De Croon et al. (2004): (1) if there are four or more studies, the statistically significant findings of 75% or more of the studies in the same direction were taken into account; (2) if there are three studies, the statistically significant findings of at least two studies in the same direction were taken into account; (3) if there are two studies, the statistically significant findings of both studies in the same direction were taken into account; (4) if there is one study, the statistically significant finding was taken into account. Otherwise, the evidence is “conflicting” regarding the relation between a performance-based measure and work participation. In addition, using the methodological quality scores, the corresponding level of evidence was scored as strong where the result is based on at least two or more good-quality studies, moderate in case of one good-quality study, and limited in all other cases.

Results

Search strategy

The search strategy resulted in 588 studies in PubMed and 642 studies in Embase. A total of 167 duplicate studies were found in these two databases. After applying the inclusion criteria to the remaining 1,063 studies, 17 studies remained. Chapter 21 “The scientific status of functional capacity evaluation” of the American Medical Association Guide to the Evaluation of Functional Ability did not result in an additional study. Neither did the experts suggest any additional studies that fulfilled the inclusion criteria. Finally, checking the references of the included studies resulted in one more study, making a total of 18 studies from eight countries: Canada, China, Germany, the Netherlands, Norway, Switzerland, and the United States of America.

Quality of the studies

The two raters agreed on a total of 261 of the 288 items (91%) for the 18 studies, with a mean difference of 1.5 per paper (SD 1.7, range 0–4). After reaching consensus, five (28%) of the 18 studies were of good quality and the remaining thirteen (72%) of moderate quality (Table 1). The mean quality score was 12 (SD = 2, range 9–14). The four quality criteria that received the least number of points across all studies were as follows: (1) the participants were not recruited during the same uniform period in time after for instance sick leave (1 out of 18 points), (2) no description of the relevant characteristics of the completers and the drop outs (8 out of 18 points), (3) no multivariate analysis was performed taking into account possible confounders (9 out of 18 points), and (4) the treatment was not described and/or standardized (9 out of 18 points).

Table 1 Quality description of the included studies according to the criteria of study population (inception cohort, description of source population, description of inclusion/exclusion criteria), follow-up (at least 12 months, drop outs, description of completers and drop outs, design of the study), treatment (standardized), prognostic factors (relevant, valid, presented), outcome (relevant, valid, presented), analysis (univariate, multivariate), and the quality score (good ≥ 13, 9 ≤ moderate ≤ 12) (See also "Appendix B" for definitions)

Characteristics of the studies

The 18 studies reported on 4,113 participants (median = 147, IQR = 152, range 30–650) (Table 2). Ten studies reported on patients with low back pain, six studies in patients with musculoskeletal disorders (MSDs) in general, and in one study on patients with upper extremity disorders. In one study, the type or region of the MSDs was not specified. In at least 78% of the studies (14/18), the MSDs were described as chronic. Seventeen of the 18 studies took place in a rehabilitation setting and one in an occupational setting. The median follow-up period of the studies is 12 months (IQR = 3, range 3–30 months). Type of treatment was described in 50% (9/18) of the studies. The other studies only described the care provider or gave no description. In 67% of the studies (12/18), confounders were taken into account to establish the relation between performance-based measures and work participation. The median number of confounders taken into account was 3 (SD = 5, range 0–14). The confounders varied between disease characteristics like pain intensity, pain-related disability or depression, personal characteristics like age, work-related recovery expectations, or being a breadwinner, and work characteristics like physical work demand level, pre-injury annual salary, or organizational policies and practices.

Table 2 The extracted data from the included studies: primary author, year of publication, country, study design (cohort or intervention (retrospective or prospective)), characteristics of the population (i.e., number of employees, age and type of MSD), the treatment given, description of the reliable performance test, the confounders taken into account, the main outcome for work participation, and a summary of whether the test protocol is significantly related to work participation (yes, no, unclear)

Performance-based tests and work participation

Thirteen out of the 18 studies used a so-called functional capacity evaluation (FCE): nine studies used the Workwell System (formerly Isernhagen Work Systems), one used the BT Work Simulator, one the ErgoKit, one the Dictionary of Occupational Titles residual FCE, and one the Physical Work Performance Evaluation (Table 2). In five of these thirteen studies, a limited number of tests of the total FCE were used. The other five studies used tests or combinations of like a step test, a lift test, or a trunk strength tester. Two studies combined the results of the performance-based test with non-performance-based outcomes like pain and Waddell signs (Bachmann et al. 2003; Kool et al. 2002).

Four of the five good-quality studies (80%) reported that a better result on a performance-based measure was predictive of work participation: one study on return to work and three studies on suspension of benefits and claim closure (Table 2). Three of these good-quality studies found no effect on sustained return to work. One good-quality study found no effect on work participation in terms of sustained return to work. All thirteen studies (100%) of moderate quality reported that performance-based measures were predictive of work participation: seven studies in terms of being employed, or (sustainable) return to work, four studies on being unemployed or non-return to work, and two studies on days to benefit suspension or claim closure.

Discussion

Methodological considerations

Selection bias and publication bias are two concerns worthy of attention when performing a systematic review. To overcome selection bias, we used five sources of information: two databases, the American Medical Association Guide to the Evaluation of Functional Ability (Genovese and Galper 2009), references of the included papers, and relevant papers suggested by the authors. The sensitivity of our search strategy for the databases was supported by the fact that checking the references of the included studies for other potentially relevant papers resulted in only one extra study. Moreover, the authors, who have published several papers on performance-based measures, could not add other studies. Regarding publication bias, this review found three studies (Gross and Battié 2004, 2005, 2006) that reported that performance-based measures of the Workwell System were not predictive of sustained return to work in patients with chronic low back pain and with upper extremity disorders. However, more studies from the same performance-based measures (Workwell System) and in similar and different patient populations reported also on a significant predictive value for work participation in terms of return to work (Matheson et al. 2002; Vowles et al. 2004, Streibelt et al. 2009) and in terms of temporary disability suspension and claim closure (Gross et al. 2004, 2006; Gross and Battié 2005; Branton et al. 2010). Therefore, there appears to be no publication bias regarding the most described performance-based measure. To prevent publication bias resulting in a higher level of evidence due to studies of less than good quality, the evidence synthesis was formulated in such a way that regardless of the number of studies of moderate or poor quality, the qualification remained “limited”. This stringent evidence synthesis was also used to do justice to the heterogeneity of the included studies regarding not only the different performance-based tests and outcome measures for work participation but also for differences regarding chronic and non-chronic patients with MSDs in different body regions, rehabilitation and occupational setting, and treatment and non-treatment studies.

Performance-based tests can be performed in patients with severe MSDs (pain intensity 7 out of 10 or higher). Patients with severe MSDs were indeed included in the studies. Of course, regardless of pain intensity, if a person is not willing to participate, then the reliability and the validity of the results should be reconsidered. In the included studies, participants were able to perform the tests and no comments were made about unwillingness to perform a test, In test practice, however, patients’ willingness to perform to full capacity is seldom a matter of 100 or 0% but almost always somewhere in between. None of the studies reported to have controlled for level of effort. When looking at these tests as measures of behavior, it is plausible that physically submaximal effort has occurred, which is consistent with the definition of FCE and also observed in a systematic review by van Abbema et al. (2011).

Performance-based measures and work participation

The use of performance-based measures to guide decisions on work participation (pre- and periodic work screens, return-to-work, and disability claim assessments) is still under debate, at least in the Netherlands (Wind et al. 2006). This is not only due to the time-consuming nature of some of these assessments but also to its perceived limited evidence for predictive value regarding work participation. Regarding the time-consuming nature, this study also showed that a number of tests were predictive of work participation: lifting tests (Gross et al. 2004; Gross and Battié 2005, 2006; Gouttebarge et al. 2009a; Hazard et al. 1991; Matheson et al. 2002; Strand et al. 2001; Vowles et al. 2004), a 3-min step test and a lifting test (Bachman et al. 2003; Kool et al. 2002), a short-form FCE consisting of tests specific for the region of complaints (Gross and Battié 2006; Branton et al. 2010), and a trunk strength test (Mayer et al. 1986). A performance-based lifting test was most often used and appeared to be predictive of work participation in 13 of these 14 studies—especially a lifting test from floor-to-waist level in patients with chronic low back pain. An explanation might be that lifting reflects a large number of physical strenuous activities such as gripping, holding, bending, and of course lifting and lowering. Besides, van Abbema et al. (2011) showed that a “low lifting test” was not related to pain duration and showed conflicting evidence for associations with pain intensity, fear of movement/(re)injury, depression, gender, and age. Thereby, these lifting tests assess more than “just” physical components. Moreover, lifting is an important predictor of work ability in patients with MSDs (Martimo et al. 2007; Van Abbema et al. 2011). Additionally, it is plausible that “shared behaviors” occur between the tests, in which case the added value of extra tests decreases. The selection of the lifting tests appears in line with the three-step model as suggested by Gouttebarge et al. (2010) to assess physical work ability in workers with MSDs more efficiently using a limited number of tests.

Regarding its predictive value, this study showed that strong evidence exists that a number of performance-based measures are predictive of work participation for patients with chronic MSDs, irrespective whether it concerns complaints of the upper extremity, lower extremity, or low back. All patients in the included studies were considered able to perform these reliable tests, and no comments were made that patients were unwilling to perform these tests. Of course, one has to bear in mind that the results of the performance-based measures are often used in clinical decision making regarding work participation. Moreover, patients are often not blinded to the outcome of the test itself (Reneman and Soer 2010). Gross and Battié (2004, 2006) and Gross et al. (2004) adjusted their outcome for the recommendation of the physician and Streibelt et al. (2009) for the expectation of the patient. Nevertheless, they still found that a number of performance-based tests were predictive of work participation. It seems worthwhile to establish how physicians and patients take into account the results of the performance-based tests and other instruments in their decision making regarding work participation.

Finally, the studies in this review used outcome measures in terms of future work participation and/or future non-work participation. Although not all studies presented relevant statistics, it seemed that the predictive strength of performance-based measures is higher for non-work participation than for work participation. For instance, for non-work participation, the predictive quality varied between poor (Vowles et al. 2004; Streibelt et al. 2009), moderate (Bachman et al. 2003; Streibelt et al. 2009), and good (Kool et al. 2002). For work participation, the predictive quality was mostly poor (Gross et al. 2004, 2006; Gross and Battié 2006; Gouttebarge et al. 2009a).

Future directions

A number of performance-based measures are predictive of work participation. Moreover, these measures differ from other relevant constructs such as pain intensity (Gross and Battié 2005; Gouttebarge et al. 2009b), self-efficacy (Reneman et al. 2008), self-reported disability (Brouwer et al. 2005; Gross and Battié 2005; Schiphorst Preuper et al. 2008; Gouttebarge et al. 2009b), and self-reported work status (Gross and Battié 2005). Also, the present study showed that potential confounders like pain intensity, work-related recovery expectations, and organizational policies and practises did not diminish the predictive validity of performance-based measures on work participation (see Table 2 “Confounders”). However, the predictive strength of performance-based measures is in general modest. Work participation is a multidimensional construct according to the ICF (WHO 2001). One cannot expect that a single instrument is able to assess such a multidimensional construct. Seen in this perspective, the conclusion of this review that the predictive validity of performance-based measures for work participation is “modest” may not be unexpected.

One way to improve the predictive strength might be combining performance- and non-performance-based measures that assess different constructs of work participation. Bachman et al. (2003) and Kool et al. (2002) combined performance-based measures with high pain scores (9 or 10 on a scale from 0 to 10) or having more than 3 Waddell signs. Vowles et al. (2004) reported that patient age and level of depression were factors best able to predict work participation. This suggests that a combination of reliable and valid measures of different constructs might improve the ability to predict work participation. Another strategy might be the following. Seventeen of the 18 studies took place in a rehabilitation setting. Generally speaking, this means that the performance-based measures are not specific for the physical demands of the future work of a patient. One study described performance-based measures resembling the physically demanding job of construction workers (Gouttebarge et al. 2009a). One study used a job demands analysis to establish a job-specific FCE (Cheng and Cheng 2010). By doing this, the minimal performance criterion that is required to perform the job is also specified. This might overcome the misconception that a better performance is always a better predictor for work participation. This information might especially be relevant for decisions regarding work participation in patients with MSDs working in physically demanding jobs (blue collar work) (Bos et al. 2002).