Construct Validity of Functional Capacity Evaluation in Patients with Whiplash-Associated Disorders

Purpose The construct validity of functional capacity evaluations (FCE) in whiplash-associated disorders (WAD) is unknown. The aim of this study was to analyse the validity of FCE in patients with WAD with cultural differences within a workers’ compensation setting. Methods 314 participants (42 % females, mean age 36.7 years) with WAD (grade I and II) were referred for an interdisciplinary assessment that included FCE tests. Four FCE tests (hand grip strength, lifting waist to overhead, overhead working, and repetitive reaching) and a number of concurrent variables such as self-reported pain, capacity, disability, and psychological distress were measured. To test construct validity, 29 a priori formulated hypotheses were tested, 4 related to gender differences, 20 related associations with other constructs, 5 related to cultural differences. Results Men had significantly more hand grip strength (+17.5 kg) and lifted more weight (+3.7 kg): two out of four gender-related hypotheses were confirmed. Correlation between FCE and pain ranged from −0.39 to 0.31; FCE and self-reported capacity from −0.42 to 0.61; FCE and disability from −0.45 to 0.34; FCE and anxiety from −0.36 to 0.27; and FCE and depression from −0.41 to 0.34: 16 of 20 hypotheses regarding FCE and other constructs were confirmed. FCE test results between the cultural groups differed significantly (4 hypotheses confirmed) and effect size (ES) between correlations were small (1 hypothesis confirmed). In total 23 out of 29 hypotheses were confirmed (79 %). Conclusions The construct validity for testing functional capacity was confirmed for the majority of FCE tests in patients with WAD with cultural differences and in a workers’ compensation setting. Additional validation studies in other settings are needed for verification.


Introduction
The term whiplash-associated disorders (WAD) has been coined for symptoms related to acceleration-deceleration injuries usually associated with motor vehicle accidents [1]. These symptoms include neck pain, headache, arm pain, and other complaints [1]. The aetiology of WAD likely combines physical and psychological factors; nevertheless, the pathophysiology is not understood [2]. Although the prognosis of WAD is generally favourable, with a recovery rate of 40-60 % within the first 12 months, a considerable number of individuals with WAD still reports symptoms and disability, 1 year after the injury [3,4]. Delayed recovery of WAD causes a substantial burden for the individual and society due to long-term sickness, absence, and work disability [5].
According to the guidelines of the International Labor Organization, diseased or disabled persons should be assessed comprehensively to avoid an over-or underestimation of safe work (dis)ability [6]. Functional capacity evaluation (FCE) can be one of the tools included in such an assessment. FCE consists of standardised batteries of functional capacity tests that aim to measure the ability to engage in work-related functioning [7]. When discrepancies between FCE outcomes and the physical workload indicate that capacity is not large enough for the required work load, this capacity may be addressed in rehabilitation programmes to reduce these discrepancies [8,9]. Moreover, FCEs are used to determine fitness-for-work, and may facilitate the return-to-work process or prelude case closure [10,11].
Functional capacity (FC) has been defined as the highest probable level of function that a person may reach in a domain at a given moment in a standardised environment [8]. Functional capacity is a multidimensional, bio-psychosocial construct, which means that FC is the result of biological and psychological abilities, positively or negatively influenced by personal and external (social) factors (e.g., test environment, education, family) [8,9]. No gold standard exists for the measurement of FC, therefore, validity must be determined by means of construct validity. Construct validity is the degree to which a particular measure relates to other measures in a way one would expect, i.e., in accordance with predefined hypotheses about the correlation or differences between the measures [10]. From a biological perspective, within the bio-psychosocial construct of FC, it can be expected that males are stronger than females and score higher on material handling and grip strength tests, and score similar in postural tolerance and repetitive work tests [11,12]. From a psychological viewpoint it can be hypothesized that in patients with WAD, FC correlates with self-reported pain and mental distress to a larger extend than in healthy workers [4,13]. However, the correlation between FC and mental distress is expected to be smaller compared to the correlation between FCE tests and other measures of functional ability and disability [9,14]. Additionally, the socio-cultural context may influence FC due to different cultural representations and expectations [15]. A study comparing FCE test results of patients with chronic low back pain (CLBP) in three different countries showed substantial differences between the study samples [16]. People from different ethnic backgrounds living in the same country reported musculoskeletal pain differently [17][18][19]. One can assume that FCE tests may result in differences between groups with different cultural backgrounds. However, this has not yet been studied.
For both, clinician and researcher it is important to know, how other measures are related to FC, in order to understand what is measured by FCE tests. Because clinical decision-making is based on the results of FCE tests, sound clinimetric properties of FCE tests are required [20].
During the past decades, reliability and, to a lesser extent, validity and safety of FCEs have been studied predominantly in patients with CLBP [10,21] and in one study in healthy persons [13]. FCE validity research should also be conducted in other chronic health conditions such as patients suffering from WAD, because clinimetric properties may not be generalisable across health conditions [22] and cultural settings [23]. Many studies on the construct validity of FCE tests did not meet the requested quality criteria such as formulating an a priori hypothesis for the strength of correlation and adequate sample size [9]. Moreover, few FCE tests were able to demonstrate adequate validity in more than one study and more than one health condition area [24].
Hence, the aim of this study was to analyse the construct validity of the FCE test for a large sample of patients with WAD, from various cultural backgrounds, who did not return to work after injury onset and who received workers' compensation, using a priori defined hypotheses (Boxes A, B) in a cross-sectional design.

Subjects and Data Collection
Subjects from the German-speaking part of Switzerland were referred by occupational physicians or case managers of the worker's compensation insurance for an interdisciplinary rehabilitation assessment at the rehabilitation clinic in Bellikon (Switzerland). Subjects were insured by the Swiss Accident Insurance Fund (SUVA), the largest accident insurance in Switzerland, which covers injuries from occupational and non-occupational accidents for employed and non-employed subjects. Injured subjects receive compensation of up to 80 % of the previous salary, medical and vocational assistance up to a maximum of 2 years, and disability pensions caused by an injury.
The reason for being referred to this assessment was that subjects had not regained full working capacity within 6-12 weeks after the initial injury, had surpassed expected injury healing times, or had plateaued with medical and other rehabilitative interventions. Inclusion criteria were neck pain due to a whiplash-associated injury according the Québec Task Force (QTF) Classification of WAD, grade I (pain, stiffness, or tenderness without physical signs) or grade II (pain, stiffness, or tenderness with reduced range of motion and point tenderness), sufficient language skills to communicate with the assessors in German language and able to fill out questionnaires in German or Serbo-Croatian, Albanian, Italian, or Spanish (representing the largest immigrant groups in Switzerland) [25], aged 18-65 years, and willingness to participate. Exclusion criteria were main musculoskeletal problem not in the head and neck region, comorbidity that considerably limited function, such as neurological deficits, rheumatoid diseases, fractures, tumours, osteoporosis, severe psychiatric disorders, pregnancy, and severe cardiac hypertension. All participants were asked for participation prior the interdisciplinary assessment. Participants were informed that they would be allowed to withdraw their participation at any time without disclosing reasons and without consequences for their medical care. The study was performed in accordance with the ethical standards of the Declaration of Helsinki and ethical approval for this study was granted by the Medical Ethics Committee of the Canton Aargau (EK AG 2010/055).
Participants' characteristics were recorded prior to the FCE, and included age, gender, body mass index, marital status, education, native language, duration since injury, education, litigation, work capacity, education status, and physical work demands. After the determination of eligibility for inclusion in the study, patients filled out selfreported measures, i.e., questionnaires (30 min) and carried out FCE tests (20 min).

Measurements
The WAD FCE analysed in this study consisted of tests involving activities of the upper extremities and the neck region, hand grip strength (left and right), lifting waist to overhead, overhead work, and repetitive reaching, left to right and right to left (Appendix 1). The reliability of all four FCE tests is good to excellent and the tests are safe in WAD [26]. Participants were briefly instructed on how to perform each test. The evaluator first gave a single demonstration of each test. The lifting test was commenced with a light weight. Participants were then asked to perform the test to their maximum ability. The weights lifted were incrementally increased according to a participant's performance, using weights of 2.5 and 5 kg. To determine the level of physical effort, testers used observational criteria indicating physical demand [7]. Testing could be terminated for four reasons: the participant stopped because of, for example, pain; the observer deemed testing to have become unsafe based on biomechanical criteria; heart rate exceeded 85 % of the age-related maximum (220 minus age of the participant); or a predefined time limit was reached. If a participant stopped the lifting waist to overhead test before the criteria for maximum level of demand was observed, the highest weight in kilogram that the patient was willing to lift five times was recorded.
Pain intensity was measured with an 11-point Numeric Rating Scale (NRS) ranging from no pain (0) to worst pain (10). The patient was asked to rate his momentary pain (''pain now''), his worst and his mildest pain during the last 7 days (''maximum pain'' and ''minimum pain'', respectively). The NRS is a commonly used scale with proven reliability and validity in patients with neck pain [27].
The Spinal Function Sort (SFS) was used to measure selfreported functional ability to perform work-related tasks and activities of daily life that involve the spine [28]. The SFS contains 50 drawings with simple verbal descriptions of activities of material handling (e.g. lifting a 10 kg milk-crate from eye-level to the floor), postural tolerance (e.g. wash dishes at a sink) and ambulation (e.g. push and pull a shopping cart). Participants rated functional ability for each activity from ''unable'' (0) to ''able'' (4). The SFS yields a single rating ranging from 0 to 200, with higher scores indicating higher or better abilities. The scores can be categorised according the work demands as defined by the Dictionary of Occupational Titles (DOT) [29], allowing a comparison with self-reported functional abilities and work demands (sedentary to lifting weights of over 50 kg). Most patients can fill out the SFS in 10-15 min. The SFS has a good reliability and high predictive validity for non-return to work in patients with back pain [14,30].
The Hospital Anxiety and Depression Scale (HADS) was used to assess the symptom severity of anxiety disorders and depression in non-psychiatric populations. The HADS consists of two scales, one for anxiety and one for depression (A and D scales, respectively). Each scale contains seven items, with each item rated from 0 (best) to 3 (worst). The scale scores are calculated by summing the responses to the items up to a maximum score of 21 points (severe case) per scale. Scale scores of between 8 and 10 identify mild, 11-15 moderate, and 16 or above severe cases of anxiety/depression. Good reliability and validity, and excellent screening properties have been reported for the use of the HADS in the general population and various clinical populations [33].

Construct Validation: Hypothesis Testing
Twenty-five hypotheses on the strength of the association of FCE tests and the additional construct variables were formulated a priori. The theoretical basis for the hypotheses is explained in the introduction. Hypotheses were inferred based on previous studies with patients with chronic low back pain: it was expected that WAD FCE is correlated to a higher extent with measures of perceived ability and disability than with measures of mental distress or pain [9,14,34]. The strength of the association is expressed in the absolute value of the correlation coefficient. From the 25, 20 hypotheses were tested about the relationship between four FCE tests and five other construct variables (displayed in Box B). Five out of 25 hypotheses for two groups with different cultural backgrounds were formulated: four hypothesis regarding the differences of FCE test results between the two groups differed significantly and, one hypothesis was formulated that no major differences in correlations between FCE tests and construct variables exist between the two groups [effect size (ES) of the correlation coefficients \0.2]. Definitions of ES for differences between two correlations are as follow: ES B0.20 [35]. The two groups with different cultural backgrounds were characterized based on the native i.e. the mother language of the participants.

Data Analysis
Normal distribution was visually assessed using P-P plots. Floor and ceiling effects were considered to be present if more than 15 % of participants achieved the lowest or highest possible score of the overhead working test [37]. The overhead working test was expected to display ceiling effects because the test was limited to a maximum of 5 min.
Associations were calculated using Pearson correlation coefficient for bivariate normally distributed data, or else a Spearman rank correlation coefficient. For relationships between gender and overhead working, and repetitive reaching, respectively, equivalence testing was performed [38]. Equivalence is established if 10 % the margins of differences between gender fall within the 90 % confidence intervals of the difference [38]. To analyse differences between genders and between two groups with different cultural backgrounds, independent sample t test, a Mann-Whitney U test, v 2 test, or linear regression was used as appropriate. The validity of the WAD FCE was considered confirmed when no ceiling or floor effects were observed in the FCE tests and the majority (80 %) of the 29 a priori hypotheses were confirmed [39]: four hypotheses concerning the relationship between FCE tests and gender, 20 hypotheses concerning the associations of the FCE tests and the other construct variables and five hypotheses concerning the two groups with different cultural backgrounds. Validity was confirmed when, significant differences in FCE test results emerged between the two groups in all 4 comparisons, and the ES for differences in correlations between FCE tests and the five construct variables between both groups was B0.2 in 16 or more of the 20 comparisons. The ES for differences between correlations of the two groups were calculated by subtracting the Z score of the German mother language group by the Z score of the non-German mother language group. Z scores were calculated as follows: 0.5 ln [(1 ? r)/(1 -r)], were r is the correlation coefficient between an FCE test and a reference measure [35]. p \ 0.05 was used as a cut-off, indicating statistical significance. For readability, the terms confirmed/not confirmed were used instead of not rejected/rejected to indicate the interpretation of the results concerning the hypotheses. Methodologically, the terms not rejected/rejected are more correct. All analyses were performed using SPSS (Statistical Package for Social Sciences, Version 21, IBM Corp.).

Participants
From January 2011 to January 2012, 428 patients were referred for interdisciplinary assessment due to delayed recovery after musculoskeletal injury. From the referred patients (n = 114), 79 (69 %) were not eligible because the main problem was not in the neck and head region; 17 (15 %) had insufficient German language skills to communicate with the assessors or not able to fill out the questionnaires in the language versions available; 5 (5 %) had acute comorbidity that limited testing, such as fracture or severe psychiatric disorder; 2 (2 %) were pregnant; 6 (5 %) were excluded due to other medical reasons; 3 (3 %) due to age under 18 or over 65 years; and 2 (2 %) were of grade III-IV by QTF criteria.
In total, 314 patients fulfilled the inclusion criteria and participated in this study. The participants' characteristics are presented in Table 1. Participants' characteristics were analysed in two groups with cultural differences, n = 152 (48 %) participants with German as their native language and n = 162 (52 %) with a non-German language as their native language. Significant differences between the groups were observed in 8 out of 10 main participant characteristics (Table 1). In five self-reported measures (Table 1), significant differences were found between the two groups.

Descriptive Analysis of FCE Test Results
Normal distribution was found in three out of four FCE tests, i.e., lifting waist to overhead, hand grip strength (right), and repetitive reaching (right). A ceiling effect was observed in the overhead working test with 38 % (n = 119) of the participants reaching the maximum time limit of 300 s. Between the two language groups and genders, the differences in FCE tests were significant in six out of eight comparisons (Table 2). There was no significant interaction between gender and language.

Construct Validation: Known Groups
As presented in Table 3, men had a significantly greater hand grip strength (?17.5 kg), and lifted significantly more weight over head (?3.7 kg). Differences between genders were in the overhead working test -7.4 s and the repetitive reaching test -8.2 s. The 10 % margin of differences between gender for overhead working was 18.5 s (90 %CI -26.2 to 11.4) and for repetitive reaching 8.8 s (90 %CI 3.2-13.2). The 90 % CI did not fall within the 10 % margin, thus non equivalence could not be ruled out. Two out of four gender-related hypotheses were confirmed.

Construct Validation: Hypothesis Testing
Correlations between the FCE tests and pain, perceived functional ability, disability, anxiety, and depression are   presented in Table 4. For each of the FCE tests, four out of five hypotheses were confirmed. Correlations for the two language groups between the four FCE tests and the reference measures are presented in Table 5. Eighteen out of 20 ES were B0.20 (ranging from 0.01 to 0.16). In two comparisons, the ES for the difference in correlations between groups with different cultural backgrounds was [0.20; -0.21 for lifting waist to overhead and the SFS, and 0.22 for lifting waist to overhead and HADS anxiety (ES data available from the author on request). The hypothesis on the validity of FCE tests in patients with cultural differences was confirmed because ES were B0.20 in the 18 of 20 comparisons.
To summarize, from a total of 29 a priori hypotheses, 23 (79 %) were confirmed (for an overview see Appendix 2).

Discussion
The aim of the study was to analyse construct validity of FCE tests for application in patients on workers' compensation due to WAD grade I and II across groups with cultural differences (defined as the native language of the participant). Twenty-three out of 29 (79 %) instead of the expected 80 % of the a priori defined hypotheses were confirmed. Confirmed were 2 out of 4 gender-related hypotheses, 5 out of 5 culture-related hypotheses, and 16 out of 20 construct-related hypotheses (overview in Appendix 2). Differences in correlations between the groups with cultural differences were statistically    The Pearson correlation statistic was used. All correlations were significant at the p value 0.01 level (2-tailed The results of the study support the bio-psycho-social construct of FCE in WAD: we observed differences between males and females (bio), between language groups (socio), and small but consistent relationships with psychological factors (psycho). The gender differences in FCE tests in this study are consistent with the results of others [11]. Differences in test results, but not in correlations, were observed between language groups. The non-German language group consisted of individuals from the largest immigrant groups in Switzerland [25]. The participants of this study consisted of 52 % whose native language was non-German, which is higher than the 18 % of the Swiss population [25]. The proportion of male participants in the non-German group in this study was similar (47.6 %) to that of the Swiss working population (51 %) [25], but higher than usually reported in WAD [1]. These differences may be explained by the fact the study participants were insured by SUVA, which insures many companies from the industry and construction sector, where the rate of male, non-German speaking subjects is higher than in the other business sectors [40]. Many immigrants have been naturalised to Swiss citizenship, hence native language was chosen as an indicator for cultural differences. Native language has been reported as a valid indicator for cultural differences [41]. A study on the coping styles of patients with low back pain found large differences among groups with different native languages in Switzerland [42].
To test construct validity, associations were made with other constructs known to be associated with FCE outcomes. In two out of four instances, the associations between gender and FCE outcomes occurred as hypothesized. Although differences were small in the overhead working and repetitive working tests, equivalence between genders could not be ruled out. We expected no difference between genders, because for this test muscle force is not likely primary factor for outcome. In the healthy population, conflicting evidence for the difference between genders in dexterity performance tests has been reported [12,43,44]. Results in fine manual dexterity tests may be influenced by finger size; smaller fingers were related to better outcomes [45]. This might be a plausible explanation of the results of this study.
In patients with CLBP, moderate correlations between FCE and SFS [14], and between FCE and other selfreported measures of disability were reported [9]. In this Table 5 Correlations between the results of FCE tests and pain, perceived functional ability, disability, anxiety, and depression separated by language groups study, FCE correlated more strongly with SFS (moderate correlations) than with the NDI (weak correlations). There could be several explanations for this. Firstly, the items of the SFS more closely resemble the items of the FCE than the NDI. Secondly, inconsistent wording of the NDI items concerning the influence of pain on activity levels may partly explain the results. Thirdly, while our hypothesis was based on the majority of the studies in CLBP where the relationship between FCE and self-reported disability was moderate, this relationship may be slightly different in patients with WAD or when using the NDI. Additionally, there may have been unknown sample characteristics contributing to these differences. The strengths of the correlations between FCE and psychological variables in patients with WAD appear higher compared with CLBP patients [9]. This may be consistent with the relevance of psychological factors in WAD [3,46]. We compared our results with a recently published study with 40 patients with WAD from the Netherlands [47]. On average the Dutch sample was younger (mean 33 years, SD 9.6), more female (55 %) and the duration since whiplash injury was longer (median 12 months, . While the results of the repetitive reaching test between the two samples were similar (mean difference 2 s), the differences between the lifted weight from waist to overhead between the Dutch and the Swiss patients with WAD was substantial (the Dutch lifted a mean of 12.2 kg more). The differences between the studies might be explained by sample variation since sample in the Dutch study was small. But these differences need further investigation. Nevertheless, they are consistent with a study that reported large differences in FCE outcomes between different countries in patients with low back pain [19]. The strength of the correlations between NDI and lifting waist to overhead and overhead working between the Dutch and the Swiss WAD samples were similar, suggesting some robustness of the results between study samples from different countries. Shortly, these findings underline the importance of replication of validation studies among different (social security) contexts.
Some potential limitations have to be addressed. The study population consisted of injured workers who did not return to work within the first 6-12 weeks, for whom recovery had plateaued, and who were referred by the case manager or occupational physician. The validity of WAD FCE should also be established in other WAD patients outside the workers' compensation setting, in general practice or in more chronic WAD patients (in rehabilitation settings). Moreover, the a priori defined hypotheses were based on previous studies performed in populations other than WAD. Most studies reported conflicting evidence on many FCE-related factors [9], so cut-offs for the strength of the correlation were arbitrarily chosen. Additionally, if other measures for construct validation had been used, the results might have been different. In this study, self-reported measures were used, which are related to physical capacity but distinct [48][49][50].
In the overhead working test, a ceiling effect was found in 38 % of the participants, as reported for healthy subjects and CLBP patients [51,52]. It was not expected that such a high proportion of patients with WAD would reach the time limit of 300 s, because one could suppose a reduced postural tolerance in the neck and upper limbs. For future research, we suggest modifying the overhead working test by having the subject wear two cuff weights of 1 kg each around on their forearm to reduce ceiling effects, as described for healthy subjects [53].
The strengths of this validation study of FCE for WAD patients were the use of a priori defined hypotheses in the analyses, allowing transparency and explicitness. Therefore, several comparisons could be made to a variety of constructs, enabling the reader to interpret the validity from different points of views. Additionally, the design and the sample size of the current study meet the proposed quality standards for FCE validation studies [22]. Moreover, patients with different cultural backgrounds participated in our study, unless previous FCE studies where languages or cultural differences were not reported [9]. To our knowledge, this has not been the subject of a study in a setting similar to ours (validation of FCE tests). Although replication is needed, the results of this study support the validity of the WAD FCE in patients with different native languages (i.e., cultural backgrounds).

Conclusion
The construct validity was confirmed for the majority of FCE tests for testing functional capacity in patients with WAD with cultural differences and in a workers' compensation setting. Additional validation studies in other settings are needed for verification.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Isometric Hand Grip Strength
Isometric hand grip strength was measured in a seated position. The subjects held their shoulder adducted without internal or external rotation, elbow flexed at approximately 90°and the forearm and wrist in neutral position. Grip strength of the right hand was measured in a three-trial procedure while maintaining in a hand dynamometer in one single handgrip position adapted to the handsize of the subject (Jamar PC 5030, Preston Corporation, 1994). An average amount of kilogram-force was scored.

Lifting Waist to Overhead Test
Lifting waist to overhead was measured during 5 lifts of the crate from table to crown in standing position, and vice versa within 90 s in standing position. The test was executed with a wooden crate (40 9 30 9 26 cm) of 2.5 kg. Weight increments of 2.5 or 5 kg each were used until the maximum amount of weight was reached. Maximum performance was recorded in kg.

Overhead Work Test
Overhead working was performed standing with hands at crown height for manipulation of nuts and bolts. The ceiling of the test was 5 min. The time that the position was held was recorded (s).

Repetitive Reaching Test
Repetitive reaching was determined by fast horizontal movements of the upper extremity in a sitting position. Marbles were removed from bowls at arm length distance at table height from left to right and vice versa, with the right arm. The time taken to remove 30 marbles was recorded (s).