The Work-ability Support Scale: Evaluation of Scoring Accuracy and Rater Reliability

Purpose The Work-ability Support Scale (WSS) is a new tool designed to assess vocational ability and support needs following onset of acquired disability, to assist decision-making in vocational rehabilitation. In this article, we report an iterative process of development through evaluation of inter- and intra-rater reliability and scoring accuracy, using vignettes. The impact of different methodological approaches to analysis of reliability is highlighted. Methods Following preliminary evaluation using case-histories, six occupational therapists scored vignettes, first individually and then together in two teams. Scoring was repeated blind after 1 month. Scoring accuracy was tested against agreed ‘reference standard’ vignette scores using intraclass correlation coefficients (ICCs) for total scores and linear-weighted kappas (kw) for individual items. Item-by-item inter- and intra-rater reliability was evaluated for both individual and team scores, using two different statistical methods. Results ICCs for scoring accuracy ranged from 0.95 (95 % CI 0.78–0.98) to 0.96 (0.89–0.99) for Part A, and from 0.78 (95 % CI 0.67–0.85) to 0.84 (0.69–0.92) for Part B. Item by item analysis of scoring accuracy, inter- and intra-rater reliability all showed ‘substantial’ to ‘almost perfect’ agreement (kw ≥ 0.60) for all Part-A and 8/12 Part-B items, although multi-rater kappa (Fleiss) produced more conservative results (mK = 0.34–0.79). Team rating produced marginal improvements for Part-A but not Part-B. Four problematic contextual items were identified, leading to adjustment of the scoring manual. Conclusion This vignette-based study demonstrates generally acceptable levels of scoring accuracy and reliability for the WSS. Further testing in real-life situations is now warranted.


Introduction
Broadly defined, vocational rehabilitation is ''anything that helps someone with a health problem to stay at, return to or remain in work'' [1]. Following illness or injury, vocational rehabilitation has an important role in assisting return to work for those who are able, and withdrawal from work for those who are unable to continue in their previous employment. An important requisite for both these tasks is the accurate assessment of work-ability.
Work-ability is a concept that can be broadly defined as ''the match between the physical, mental, social, environmental, and organisational demands of a person's work and his or her capacity to meet these demands'' [2] (p 1173). Measurement of work-ability therefore requires consideration of a range of factors, including physical ability to perform tasks, ability to cope with the cognitive/communication demands of the job, and to function appropriately in the social and environmental context of the work.
Although a number of measurement tools have been developed for work-ability, our recent review of these highlighted a number of limitations which limit their use in clinical practice [2]. Despite the multifactorial nature of work-ability, the majority of measures focus predominately on 'physical' ability to do tasks. They rarely take into account the role of contextual factors or key stakeholders, and few tools have actually been developed with an intention to aid or assist in rehabilitation planning.
The Work-ability Support Scale (WSS) is a new measure that has been developed as part of a long-standing international collaboration between the United Kingdom (UK) and New Zealand (NZ). We set out to develop a tool which would not only cover all the key factors that contribute to work-ability, but would also provide a practical resource for clinicians to use for planning vocational rehabilitation/support in the course of routine practice. An overview of the conceptualisation, design and development is being presented for publication elsewhere. In brief, it included a mixed methods design incorporating: (a) a conceptual review; (b) qualitative work to inform the provisional structure of the tool, item definition, scoring framework, and the manual for training and score derivation; and (c) quantitative evaluation of psychometric properties.
The conceptual review is published [2] and the qualitative work underpinning item generation and evaluation of utility and usability is described in more detail elsewhere. In this paper, we describe an evaluation of scoring accuracy, intra-and inter-rater reliability as part of (c).

Setting and Design
Development of the WSS involved an iterative process of testing and refinement. To extend the eventual generalizability and utility of the tool, this was conducted in two different health cultures and services settings-a local community-based vocational rehabilitation setting in New Zealand and a tertiary post-acute, primary hospital-based rehabilitation service in the UK. Two rounds of evaluation were undertaken during that process.
• A preliminary round of inter-rater reliability testing undertaken in New Zealand (round 1) utilising case histories, led to the identification and adjustment of weaker items within the tool.
• Following revision based on the results of the preliminary round and the development of a set of test vignettes with reference standards, further evaluation was undertaken in the UK (round 2) as the penultimate stage in its production, to assess scoring accuracy as well as inter-and intra-rater reliability.
The Work-ability Support Scale (WSS) The WSS is a tool designed to: • Assess the individual's ability to work and support needs in the context of their normal work environment, following the onset of acquired disability, • Support decision-making with regard to vocational rehabilitation.
It encompasses the complexity of physical, cognitive and behavioural challenges that are typically associated with neurological disability. However, it also has application in the more general context of work-related disability.
In its clinical application, the WSS is intended to be applied by a clinician on the basis of direct observation and interview with managers/co-workers in the course of a work-based vocational assessment. Alternatively, however, it may also be applied as part of screening to determine whether return to work is likely to be possible at an earlier stage in recovery. In this case, rating would be based on the anticipated performance in the workplace, deduced from off-site assessment of function in relation to a description of the individual's work-based activities and job role. This type of application has been used to useful effect in a number of work-planning scenarios in the UK setting, including: • where withdrawal from work was considered the only appropriate option, and a timely decision was required to avoid the individual losing out on pension payments (the WSS identifying that the likely level of work support required would be unsustainable). • where the individual and/or their family had difficulty accepting that return to their current job role was not a realistic option. Scoring of the WSS supported dialogue between the patient and team as a step towards accepting exploration of alternative work and life roles. • where return to work was considered feasible, but a strong case had to be put forward to support an application for funding for vocational rehabilitation.
The conceptual design of the WSS was based on a 7-level scoring framework similar to that of the Functional Independence Measure (FIM) [3]. The FIM is the most widely used outcome measure for rehabilitation across the world, and this framework was chosen because clinicians are broadly familiar with this concept.
The WSS is divided into two main parts. Part A is a 16-item scale divided into three domains of work-related function: • Physical function (five items) • Thinking and communicating (five items) • Social/behavioural function within the work-place (six items).
Each item is rated on a standardised ordinal 7-level scoring system ranging from 7 (completely independent) to 1 (totally unable), with the levels between reflecting an increasing requirement for help/support and the consequent decrease in work productivity. For example: • At level 7, the individual manages all of that aspect of his/her work without help. He/she performs independently without undue effort, and without the need for job modification. He/she requires no more equipment or strategies than would be considered normal for the role. Work productivity is fully maintained. • At level 1, he/she effectively unable or require so much supervision/support that work productivity would be minimal.
Part B is 12-item scale of contextual factors, relating to personal and environmental/support factors which may influence return to work. These are also divided into three domains: • Personal factors • Environmental factors • Barriers to return to work.
Originally scored on a five-point scale, the scaling was adjusted after the first round of inter-rater testing to a simpler 3-point scale indicating the overall effect of the contextual factor (positive effect = ?1, neutral or unknown effect = 0, negative effect = -1).
The final tool structure and scoring levels are given in ''Appendix''. It is recommended that the WSS is always applied using the scoring manual to ensure scoring accuracy, and this may be downloaded freely from our website http://www.csi.kcl.ac.uk/wss.html.

Case Histories and Vignettes
In line with previous studies of this kind [4,5] we used case histories and test vignettes to describe various levels of independence in work-related function as the test material for the purpose of evaluation. The production of generalisable data on inter-rater reliability presents a challenge for tools that measure functional ability, because of the practical difficulty of assembling a large number of raters around an individual patient to observe the same task in real-life settings. Proxy materials, such as case histories, videos or vignettes are therefore frequently used for preliminary testing [6].
• A case history is a short monologue describing a range of different attributes and characteristics relevant to the domains of interest, from which the rater extracts the relevant information to apply a rating for each item in the measurement tool. • A set of vignettes provides more concrete and specific descriptions of function, focussing on one item at a time, and fixing the level of ability within that item [7].
When several raters use case histories to apply a measure independently, variations in scoring levels may reflect (a) the information they extracted to rate a given item and (b) their interpretation of that information to fit within the cut-points of the scale. When vignettes are used, the information is more standardised, so that variation is attributable to (b) alone. In theory therefore vignettes should produce less inter-rater variation. However, this depends to some extent on how they are written, and on how closely the vignette description mirrors the language of the measurement tool.

Round 1: Preliminary Evaluation of Inter-rater Reliability
The first round of evaluation was undertaken in NZ in 2009/10, in the context of community-based services. The design of this phase was based around multiple raters scoring case histories, which were written up from clinical assessments.
Case Histories A set of 30 case histories were written up by Authors JF and KM from detailed notes taken during clinical work site assessments. The assessments were conducted by trained occupational therapists working in a vocational rehabilitation setting, and the clients of these assessments gave informed consent for notes from their assessment to be anonymised and written up as case histories for the research. Each case history was about 800-1,500 words and contained information about the individual, their clinical condition, their work role and a range of information about their working ability, relationship with colleagues and other specific information that would enable the rater to score each item. The case histories spanned a range of neurological and musculoskeletal problems of varying severity leading to a range of cognitive and/or physical disabilities. They also included a wide range of jobs and work roles in different contexts, including indoor and outdoor occupations.
Raters Five occupational therapists were recruited from community-based services providing vocational rehabilitation. The raters were all experienced work-place assessors who had not previously been involved with the cases described in the histories.
All raters were novice to using the WSS, so underwent a 4-h training session which used a similar format as standard FIM-training in New Zealand. It included an overview of the tool and scoring structure, orientation to the scoring manual, and practice cases which were worked through using the scoring manual and discussed in groups.
After the training, each rater scored the 30 case histories over a four-week period, again using the scoring manual. The cases were randomly numbered to avoid any systematic bias, and each assessor was presented with the cases in a different order to avoid any effects based on the order of rating.
Raters gave a score for each item for each case history and made comments where they felt scoring decisions were difficult to make or item descriptions were ambiguous. These comments were subsequently analysed to identify remaining ambiguities in item description and scoring instructions. Based on this feedback, revisions were made to the affected items and further inter-rater testing with four raters was conducted on the modified items only.

Statistical Handling
Inter-rater agreement was tested using the multi-rater method described by Fleiss [8]. Kappa coefficients for multiple raters (mK) were calculated using the Statistical Analysis software (SAS Ò ) macro MKappa [9]. Table 1 shows an item-by-item analysis of inter-rater agreement. The majority of Part A items showed moderate to substantial agreement. In response to feedback, 'communication' was divided into two items (written and verbal); significant changes were made to three other items, and there was also some re-grouping within the subscales. However, the contextual items showed only fair agreement. Discussion indicated that the items in the contextual factors domain were too broad and very difficult to score, so a substantive restructuring of that part of the scale was undertaken. Following these revisions, five modified work functioning items and seven new contextual factors items were re-tested for inter-rater agreement, with modest improvements demonstrated. However, agreement was still only moderate for the contextual items and further adjustments were made, expanding the number of items and simplifying the rating to just three scoring levels indicating positive, neutral or negative impact.

Round 2: Evaluation of Scoring Accuracy, Interand Intra-rater Agreement
This further round of evaluation was undertaken in the UK in January 2011, during the penultimate stages of development, once the structure of the tool had stabilised. Conducted with primarily hospital-based clinicians, it sought to address the following: (a) scoring accuracy of individuals and teams against the set of reference standard scores, (b) inter-rater reliability for both individual and team scores, and (c) intra-rater reliability for individual and team scores rated on two occasions 1 month apart.

Vignette Development
A possible weakness of the case histories such as those used in round 1 is that they contain a large amount of information requiring considerable concentration and retention on the part of the rater. Scoring differences may arise from the raters using different information from within the history to judge the level for a particular item. For round 2 in the UK we therefore used a more targeted vignette-based approach, analogous to the 'case studies' that are used for training and accreditation of the FIM in the US, Australia and the UK [10]. A preliminary description is given of each hypothetical case, followed by a brief description in 50-100 words of their work-related function under each of the item headings in the WSS. This enables the vignette writers to ensure that each item is tested across the range of scores. A series of vignettes was drawn up by authors KM and LTS. In order to mimic the complexity of cases seen in clinical practice [4], they were designed to represent a range of difficulty for scorers-some led to a clearly evident score when referring to the manual, and others were less clear, requiring the rater to decide between one or more possible scoring levels. During development of the vignettes, the two authors first rated them independently, and then conferred to agree a 'correct' or reference standard rating to be assigned to each vignette.
When designing the study, consideration was given to rater burden among practising clinicians and the feasibility of rating large numbers of vignettes within an acceptable time allocation. The final set of 196 vignettes used for this evaluation related to 7 case studies-7 9 16 = 112 vignettes for Part A items and 7 9 12 = 84 for contextual items. The item scores were purposively chosen to provide good coverage of the range of possible scores for each item.

Vignette Rating
Six clinicians took part in the study. No specific training was provided, but by this time the tool had been in routine clinical use within this unit for some years. To be included, participants were required: (a) to have clinical experience in neurological rehabilitation focussed on work-related function, (b) to have some experience with rating the WSS, and (c) to be available for the two rating occasions 1 month apart.
The six raters were all affiliated to a large regional specialist neurorehabilitation unit spanning hospital and community outreach services in London, UK. All raters were occupational therapists, but were selected to represent a range of experience, both clinically and in use of the WSS, i.e. we included both senior and junior staff. They were organised into two teams, again representing a range of experience, in an attempt to mimic the pattern of scoring ability normal in clinical practice.
On the first occasion (test time 1), each clinician rated the vignettes individually, without conferring, but with reference to the scoring manual. As in round 1, the vignettes were presented to each of the raters in a different order. The following week, they met to score the vignettes as a team. This process was repeated 1 month later (test time 2), leaving sufficient time to limit recall bias.
According to the manual, if there is disagreement between team members when rating as a team, the lower score is recorded (as is also the convention on rating the FIM).

Data Handling and Analysis
The literature contains many different approaches to the testing of agreement between and within raters, and as yet no universal approach has emerged.
• The percentage of agreement between raters provides a simple descriptive analysis, but can be misleading as it does not take into account the extent of agreement that is simply due to chance. • Cohen's kappa was introduced to adjust for chance agreement [11], but un-weighted kappas do not account for the 'degree of disagreement', where disagreement of one category may be acceptable but wider disagreements may not. • Weighted kappa coefficients were introduced in the late 1960s to provide partial credit for scaled disagreement [12] and are recommended by the Medical Outcomes Trust to evaluate agreement between raters for ordinal scales [13]. This is particularly relevant for long ordinal scales with more than five or six scoring levels per item. • Cohen's kappa coefficients, however, test agreement between a single pair of raters. Fleiss 1971 proposed a method for generalisation of the kappa statistic to the measurement of agreement among multiple raters [14], although this un-weighted method does not take any weighting of disagreements into account. • An alternative approach used by some authors is to treat each combination of raters as a separate data pair [4,5]. This means however, that each kappa coefficient represents multiple pairwise comparisons, thus effectively representing an average across the group. This is a potential statistical limitation especially if the data are unbalanced.
In this round, we evaluated scoring accuracy, intra-rater reliability, and inter-rater reliability.
The dataset comprised a total of 42 individual ratings (6 raters 9 7 cases) and 14 team (2 team 9 7 cases) ratings at each of the test times 1 and 2.
• Scoring accuracy was evaluated through rater agreement with the reference standard scores. Data were pooled from test times 1 and 2 to generate n = 84 individual paired ratings for each item, and n = 28 team ratings per item. • Inter-rater reliability was evaluated at test time 1 only, testing agreement between all possible combinations of rater pairs for individuals and teams. For individual raters, 15 possible pairings generated 105 pairs (15 9 7). As there were only two teams, team ratings generated just N = 7 pairs. • Intra-rater reliability was evaluated for both individual and team scores between paired ratings test times 1 and 2, giving n = 42 individual and 14 team ratings per item.
WSS total scores were compared using intra class correlation coefficients.
For item-by-item analysis we used a number of the approaches described above.
• For all comparisons, we report descriptive statistics in terms of percentage of absolute agreement. This provides the opportunity to compare individual and team ratings, even though, at n = 7 and n = 14 respectively, the numbers of team ratings were too small for statistical analysis of inter-and intra-rater reliability. • We also report agreement ±1 level for Part A (which include 7 possible scoring levels)-but not for Part B (which includes only 3 levels). • For scoring accuracy and intra-rater reliability we report weighted kappas. • For inter-rater reliability we report both weighted kappas between all pair combinations and also Fleiss's kappa for multiple raters.
Linear-weighted Cohen's kappa statistics were computed using STATA version 12.1 (Stata Corp., 2012), and interpreted according to Landis and Koch's classifications [15]. The 95 % CI for these weighted kappa statistics were calculated using bootstrapping, employing the method and macro given by Reichenheim [16].
As with round one, inter-rater agreement was tested using the multi-rater method described by Fleiss [8]. Kappa coefficients for multiple raters (mK) were calculated using the Statistical Analysis software (SAS Ò ) macro MKappa [9].

Scoring Accuracy
Overall scoring accuracy was evaluated by ICCs comparing total subscale scores with reference standard ratings. ICCs for individual ratings for Part A and B were 0.95 (95 % CI 0.78-0.98) and 0.78 (90.67-0.85) respectively. The ICCs for team ratings were 0.96 (0.89-0.99) and 0.84 (0.69-0.92).
An item-by-item analysis of scoring accuracy in relation to reference standard scores is shown in Table 2 using linear-weighted kappa (kw). Agreement between the test ratings and reference standard scores, was either in the 'Substantial' (kw 0.71-0.78) or 'almost perfect' (0.81-94) range for all Part A items confirming a high level of scoring accuracy for this part of the scale. Three of the 12 contextual items ('Employer contact', 'Employer flexibility', and 'Vocational support') achieved only moderate scoring accuracy (kw 0.53-0.60), and 'Legal issues' showed only slight agreement (kw 0.34 95 % CI 0.18-0.51) with the reference scores.
When vignettes were rated by a team, the scoring accuracy was marginally higher, achieving a mean 73 % (SD 15) absolute accuracy, compared with 66 % (SD 13) for individual ratings in Part A. For Part B, the mean percentage accuracy of team and individual ratings were similar (78 % (SD 11) and 79 % (SD 11) respectively).
The mean differences between the rater scores and the reference standard scores for the 16 Part A items were compared using paired t tests taking p B 0.003 as the threshold for significant to account for multiple tests. This showed that individual raters scored significantly higher than the reference standard on 9/16 items (other item differences being non-significant). When rating in groups, however, scores were significantly higher for only one item (all other item differences again non-significant). This suggests that (a) discussion assisted more accurate scoring (in relation to the reference scores) and (b) raters were following the manual instruction to record the lower score where group members disagreed.

Inter-rater Reliability
Inter-rater reliability is summarised in Table 3.
Using linear-weighted kappas, agreement between individual raters ranged from kw 0.63-0.90 for all Part A items confirming a high level of inter-rater reliability. In Part B, 'Employer contact' and 'Vocational support' again achieved only moderate levels of agreement (0.41 and 0.22 respectively) and 'Legal issues' showed very poor agreement (kw 0.11 (95 % CI -0.05 to 0.27)).
Inter-rater reliability was marginally higher for team ratings mean 73 % (SD 15) absolute agreement compared with 63 % (SD 9) for individual ratings in Part A. For Part B, percentage accuracy of team and individual ratings were similar [76 (SD 21) and 75 (SD 14)]. Kappa coefficients were not calculated because of the small number of paired team ratings.
In Table 3 we have also included an analysis of Fleiss multi-rater kappas. These (unweighted) kappa coefficients are significantly lower than the linear-weighted Cohen's kappa statistics for the same dataset-mK for Part A ranged from 0.07 to 0.79, and for Part B from 0.07 to 0.86. They are included to highlight this difference (see ''Discussion'' section). Although not strictly comparable with the round 1 analysis (because of the different number of cases in the two evaluations) they give a broad indication that agreement is similar to that seen in round 1 in the physical and cognitive domains, but somewhat lower in the social/behavioural domain.
Despite the lower values compared with linear weighted kappas, the mK coefficients generally reflect a similar pattern, identifying the same three poorly performing contextual items-Employer contact, Vocational support and Legal issues.

Intra-rater Reliability
Intra-rater reliability is summarised in Table 4.
Item-by-item again analysis again showed 'substantial' to 'almost perfect' intra-rater agreement across all Part A items (kw 0.71-0.95) and were generally also satisfactory for Part B with the exception of two items-'Vocational support' and 'Legal issues' which showed moderate agreement (kw 0.50 and 0.54 respectively).
Intra-rater reliability improved for Part A items when clinicians rated in teams-mean 82 % (SD 8) agreement for team scoring compared with 69 % (SD 11) for individual rating. But once again, team and individual rating were similar for Part B items-mean 83 % (SD 10) and 82 % (SD 13) respectively.

Discussion
In this article we have described an iterative process of evaluation and adjustment of the WSS, across two continents and in service settings spanning hospital and community. This approach was deliberately utilised to ensure that the final tool would have applicability across a range of health culture and experience.
The initial evaluation, based on case histories in the context of community-based rehabilitation in New Zealand, led to a significant restructuring and re-design to make the tool more useable by clinicians. The subsequent vignette-based study, centred on a primarily hospital-based service in the UK, demonstrated acceptable levels of scoring accuracy and reliability for the WSS Part A, both between raters and over time. Team ratings are expected to be more reliable, which may reflect both a learning effect and the instruction in the manual to record the lower score in the event of disagreement. In this study the tendency for Table 2 Agreement between reference standard scores and ratings by individuals and teams, using pooled data from time 1 and 2 Individual raters (n = 84 paired ratings)   Table 3 Inter-rater agreement between of individual and team ratings (time 1 only) Individual raters (n = 105 paired ratings) Multi-rater analysis 6 rater pairs    teams to record lower item scores than individual raters suggests that they were following this instruction, which tends to reduce variation. Nevertheless, although team ratings were marginally more reliable than individual ratings, the latter still achieved very acceptable overall levels of accuracy and reliability and may therefore be considered adequate for clinical practice. Part B (contextual items) proved more problematic to rate, despite the adjustments made after the preliminary round of evaluation. Rating of 'Personal' factors-such as having the desire or confidence to work, realistic expectations and personal support from family/friends achieved acceptable scoring accuracy and reliability in all cases. However 'Environmental factors'-in particular 'Employer contact and flexibility' or 'Vocational rehabilitation/support' appeared to be more open to interpretation. 'Barriers to return to work' caused confusion because of the negative scoring system. However, even after this was corrected, the item concerning 'Legal issues' continued to show poor reliability. These scoring difficulties could either be due to the item descriptions or to the fact the vignettes for these items were harder to rate.
Vignettes were designed to mimic the complexity of cases seen in clinical practice with some being harder than others to rate, so the authors reviewed all the reference standard scores and discussed the ratings with the teams. These reflections identified some particular problems with the 'zero' and 'positive' scores for contextual items. For example, an on-going legal compensation claim was generally accepted as a negative influence on return to work, but the absence of such a case was variably interpreted as either 'neutral' or 'positive'. Further adjustments have since been made to the scoring manual to instruct the rater to default to 'zero' scores for the contextual items, and only rate scores on either side of this if a given factor presents a clear positive or negative influence. Nevertheless, the findings presented here across several rounds of evaluation suggest that the contextual items are (and probably always will be) vulnerable to variable interpretation. Whilst clinicians agree that these are important factors to take account of in individual care planning, for the purpose of measurement, we suggest they should be used as a clinical checklist alongside the WSS, rather than as an integral part of the measurement tool.
In this study, we also explored a variety of approaches to measuring agreement. Because the WSS items comprise seven scoring levels, weighted kappas were considered to be most relevant and we also reported percentage of agreement ±1 scoring level. As may be expected, the weighted kappa statistics provided an estimate of agreement somewhere between the percentage of 'absolute agreement' and 'agreement ±1'. For inter-rater reliability, there was some concern that multiple pairwise comparisons using linear weighted kappas may give spuriously high results, and for this reason we also applied the (unweighted) Fleiss multi-rater method which gave substantially more conservative estimates of agreement-again as would be expected. The future design of a weighted method for calculating multi-rater kappa statistics would be a welcome statistical development, but to our knowledge no such technique currently exists. In the meantime, these differences highlight the importance of reporting the statistical methods used, as they may otherwise lead to significantly different conclusions about the reliability of tool performance.
Our findings must be interpreted in the light of some clear limitations to the study.
1. Vignettes were chosen in this evaluation to provide a stable, fictional presentation of a patient's functional ability. This ensures that different raters are basing their scores on the same information. They do, however, provide a limited insight into the patient's holistic ability, and cannot entirely replace field testing. 2. The vignette sample size was strictly suboptimal. The computation of Cohen's kappa values is often said to require a sample size of 2K 2 [4] which in this context would be 98. There is always a balance to be found between the use of hard-pressed clinicians' time, and obtaining optimal numbers for statistical analysis. Introducing more vignettes would have reduced the number of volunteers, so compromise was therefore accepted. However, as increasing the sample size tends to increase the estimates of agreement, which were high even with a small sample, it is unlikely an expanded dataset would have altered our conclusions significantly. 3. The weighted kappa coefficients for inter-and intrarater agreement were calculated on pooled samples for all six raters and thus effectively represent the average across the group. This is a potential statistical limitation if the data are unbalanced. At a clinical level, however, the pooling of data supports generalisability as the full range of inter-rater variability is represented. 4. The unit where the round 2 (UK) evaluation was undertaken was one of the locations in which the WSS was developed. Although the raters were purposively chosen to represent a range of experience, they would undoubtedly have had more experience with the WSS, than the average clinical centre using the scale, at least when it is first introduced.
Not withstanding these limitations, the findings provide preliminary evidence of reliability, which supports use of the WSS as a reproducible tool for assessing work-ability. Further testing in a wider sample and in the context of clinical application is now recommended.