Introduction

Inter-rater Reliability In Social Medicine and Work Ability Assessment

The assessment of work ability in social medicine requires observation and exploration data [1]. Work ability cannot simply be judged by asking the patient. But especially observation data vary depending on the individual decisions within an interviewer’s assessment, or by the assessment setting (outpatient assessment of one hour versus inpatient assessment with observation data over 5 weeks from different professions). To minimize variance in observation-based data interpretation and to get more objective standards for work ability measurement, inter-rater reliability is needed. The inter-rater reliability compares the assessment of two or more raters in the same clinical cases [2].

Inter-rater reliability has been a part of the evaluation of several participation-oriented instruments in recent years. For example: in the measurement of Activities of Daily Living (ADL; [3] and Global Assessment (CGAS; [4]) for children and in elderly care a fair to high inter-rater agreement of r = 0.27 to r = 0.94 were reported [5,6,7]. Lenze et al. [8] found high intra-class correlations for The Pittsburgh Rehabilitation Participation Scale (PRPS).

Inter-rater reliabilities are routinely assessed for symptom observation. For example, reliabilities of instruments assessing obsessive–compulsive symptoms (AMPD Rating Scale; [9] and disorders (Y-BOCS; [10], show inter-rater reliabilities of r = 0.47 to r = 0.93 [9, 11]. For the Children’s Depression Inventory [12] and the German structured diagnostic interview for mental disorders in children and adolescents (Kinder-DIPS; [13]) high kappa statistics between r = 0.86 and r = 0.90 were reported [14].

In assessment of social performance and work ability there are also some measures with reported inter-rater reliabilities. Schaub, Brüne, Jaspen, Pajonk, Bierhoff and Juckel [15] investigated the inter-rater agreement in the Personal and Social Performance Scale (PSP; [16]) in a sample of chronic schizophrenia patients and found ICC between r = 0.54 and r = 0.82 [15]. In a validation study of MELBA [17], a measurement tool for vocational rehabilitation and inclusion, Achterberg, Wind, Prinzie and Frings-Dresen [18] reported poor to moderate and some excellent ICC. For Assessment of Life Habits (LIFE-H; [19]) for older adults Noreau et al. [20] found poor and good ICC between r = 0.30 and r = 0.97 [20].

Reviewers have criticized the low inter-rater reliabilities in several studies within the social medicine practice [21]. Low agreements in inter-rater reliability could arise from different or undefined reference contexts serving as a rule [1]. The question of how naturalistic the conditions for the assessment of work ability are e.g. in a clinical setting has to be kept in mind. Nevertheless, the examination of objectivity constitutes an important aspect of test accuracy and compliance with the quality standards of psychometric tests [22].

The Mini-ICF-APP – an Instrument for Objective Work Ability Assessment

The Mini-ICF-APP is one of the leading measurement tools for work ability assessment and permits a systematic description at the level of disabilities and impairments [23]. It was adapted from the structure of the ICF and combines functions, capacities and participation in an interactive multidimensional construct [24]. The Mini-ICF-APP was validated several times in German [25], Italian [26], English [27] and Polish language [28]. Psychological capacities and participation (impairment) can be described by using a semi-structured interview on 13 capacity dimensions: adherence to regulations, planning and structuring of tasks, flexibility, applying expertise, capacity to judge and decide, endurance, assertiveness, contacts with others, teamwork capacity, self-care, mobility, proactivity and familiar and intimate relationships [25, 27]. Usually the Mini-ICF-APP is used for measuring work ability in the context of an existing present workplace or potential workplaces on the general labour market depending [1, 25, 27]. The Mini-ICF-APP and its capacity dimensions are now the core contents in the AWMF guidelines for social medicine assessment of mental disorders [29]. Additionally, a self-rating version and a version for assessment of capacity demands of workplaces has been developed [30, 31].

Interrater-Reliability of the Mini-ICF-APP

In the Mini-ICF-APP validation studies inter-rater reliabilities were reported as well. Some critical voices described the Mini-ICF-APP as “unsuitable instrument” [32] with low inter-rater reliabilities [33], like r = 0.43 in the study of Kunz et al. [34]. But, to fix the problem of divergent ratings due to different contexts of references, Linden et al. [25] had conceptualized a training for Mini-ICF-APP assessment, which increased the inter-rater reliability of r = 0.70 up to r = 0.92 [25]. Manual and exploration guidelines give the basis for Mini-ICF-APP ratings [25, 31]. The important basis for reaching good inter-rater reliability is that raters refer to the same and clearly defined context. The rating can only be as good as the reference contexts are defined and the raters refer to these definitions when making their ratings. The reference context must be defined before the exploration and the rating. Contexts may be for example a specific workplace with specific capacity demands, or the general labor market which requires a basic level in all capacities, or living on one’s own with the demands of daily duties of housework and general life activities [31].

There are several studies, which reported good to excellent inter-rater reliabilities of the Mini-ICF-APP. Muschalla [30] found high kappa between r = 0.71 (endurance) and r = 0.94 (mobility) for the German version. Balestrieri et al. [26] revealed ICC between r = 0.79 and r = 0.98 in the Italian validation study. The weakest ICC was found for mobility [26]. Molodynski et al. [27] reported a mean score of r = 0.89 for the English version of the Mini-ICF-APP. The authors of the Polish Mini-ICF-APP study found ICC between r = 0.59 (resistance and endurance) and r = 0.80 (competence to judge and decide, proactivity and spontaneity) [28]. Inter-rater reliability can increase the longer the raters are trained [25].

Training Effects in Professionals

The principle of “learning by doing” and gaining experience over time is applied in many occupational fields [35]. Psychotherapists and physicians carry out practical training [36, 37]. Meta-analyses by Dush, Hirt and Schroeder [38] and Lyons and Woods [39] found significant effects between the practical work experience of psychotherapists and the effectiveness of the treatment of child behavior disorders [38] and rational-emotive therapy [39]. Dauwerse, Stolper, Molewijk and Widdershoven [40] emphasize the importance of practice over time in the training of health care professionals.

Training effects over time were also found in the assessment of psychological issues. Warshaw, Dyck, Allsworth, Stout and Keller [41] conducted a long-term study to test the inter-rater reliability with measurement points after 1, 6 and 12 months with the Longitudinal Interval Follow-up Evaluation (LIFE). The authors compared new raters with experienced raters and found tendentially higher intra-class correlation coefficients for the experienced raters at all measurement points [41]. With increasing experience of raters, the focus on situation-specific features, the contextual reference as well as the interpretation of observed behavior in contrast to behavior descriptions increases [42, 43].

Vocational Training

Until now, several research studies have shown the usefulness of vocational trainings: Lysaker, Davis, Bryson and Bell [44] examined the effect of a vocational rehabilitation program (Indianapolis Vocational Intervention Program (IVIP)) for patients with schizophrenia spectrum disorders compared with the usual service in job placement. Participants in the rehabilitation program found a job significantly faster and generally performed better in the workplace than participants of the control group. An RCT study was also conducted for the application of vocational rehabilitation programs for affective disorders. Bejerholm, Larsson and Johanson [45] reported a higher effectiveness of a disorder-specific vocational rehabilitation program (in this case Individual Enabling and Support (IES)) compared to traditional vocational rehabilitation. In addition to the higher rate of employment and higher number of working hours per week, the depression scores were significantly reduced. A controlled study by Watzke, Galvao and Brieger [46] showed a significantly higher employment rate, a reduction in symptoms and a subjectively higher level of well-being and functionality in the 9-month follow-up after a vocational rehabilitation program in patients with various mental disorders compared to the control group. In two controlled, randomized studies with patients of different mental disorder groups, Wallace and Tauber [47] emphasized the effectiveness of workplace-related skills training. With the vocational skills trainings, patients were more likely to find a workplace, worked a higher number of hours and earned more than patients without workplace skills training [47]. An RCT study by Berglund et al. [48] compared the employment rate between the two vocational rehabilitation programs Multidisciplinary assessments and individual rehabilitation management and Acceptance and commitment therapy (ACT) and a control group. Participants in the training programs were able to return to their jobs significantly more often and reported having increased employability compared to the control group [48]. A systematic review by Michon et al. [49] reports the person-related predictors of employment outcomes after participation in vocational rehabilitation programs. In 8 of 16 studies, better work performance, higher self-efficacy and increased social functionality were identified as strengthened factors after a vocational rehabilitation program [49]. In the field of early psychosis, a systematic review of the effectiveness of early intervention programs for employment was generated by Bond et al. [50]. In the eleven studies, 29% of patients were employed with the usual support, the employment rate among vocational services patients was 49% [50]. In the rehabilitation of patients after acquired brain injury, a review of 12 studies by Donker-Cools, Daams, Wind and Frings-Dresen [51] showed that workplace training, skills training and education are effective vocational rehabilitation programs.

In Switzerland, persons with chronic mental health problems and problems with work ability can participate in longitudinal vocation reintegration programs. The here investigated program is individualized according to the concrete health and participation problems of the person. The participants of the program are integrated in companies and work in concrete workplaces which fit their capacity level. Participants are monitored and supported by social work professionals and physicians over the course. Participant’s capacity level (and impairments) are monitored over the course of the longitudinal program (in this present investigation: from 2018–2019).

Question of Research

This study was the first evaluation of the Mini-ICF-APP in Switzerland and included three points of measurement within vocational trainings. For a reliable insurance-medicine assessment in questions of invalidity and work ability, the Mini-ICF-APP is a fixed component in Switzerland since 2014. As a part of Switzerland’s social security law [52], it is indispensable to examine the fulfilment of quality standards and correspond assessments of different reviewers in the Mini-ICF-APP. This present study took on this task and investigated quantitative and qualitative changes in the assessment of the Mini-ICF-APP ratings across three points of measurement. Instead of using the mean score, the inter-rater reliability was calculated per each of the 13 capacity dimensions itself. The capacity assessment was done with participants of a vocational training program which was several months of duration.

The questions of research were therefore:

  1. 1.

    What is the inter-rater reliability of the 13 capacity dimensions of the Mini-ICF-APP in a Switzerland naturalistic vocational training setting?

  2. 2.

    (How) Do the inter-rater reliabilities of the capacity ratings change across three measurement points?

Methods

A sample of training reports on 61 vocational training participants were investigated. They have been collected from three different disability insurance institutions in Switzerland (CIS, Ritec, Sonora) between January 2018 and August 2019. An assessment of the capacity dimensions of the Mini-ICF-APP was carried out at the beginning of the intervention (t0), after three months (t1) and after six to nine months (t2). The vocational training was done with patients with chronic illnesses of diverse types, e.g. musculoskeletal diseases, malignant tumors, diabetes, organ damage, with resulting mental impairments and mental disorders. The 13 capacity dimensions of the Mini-ICF-APP were rated by two different raters in each training report. The raters had two different professional backgrounds: The Swiss Disability Insurance employs consultants and job attendants. Consultants are integration specialists who inform and advise insured persons about disability insurance benefits and support them in their vocational integration process in the cantonal office. The job attendants are Master Social Professionals who supervise the vocational integration process in the training centre. Within the vocational training, both professionals are focusing on capacity training. On behalf of the disability insurance, integration programs for vocational rehabilitation are carried out in the training centers. The raters conduct these training programs and are available to advise the patients during the intervention. The training contents simulate the specific field of work and aim for the (re)development of job-relevant capacities. Therapeutic methods, e.g. learning relaxation techniques, are also used.

Each rater completed a rater training for Mini-ICF-APP inspired by the manual [25]. The training included the meaning of accurate description of the reference context, the assessment of capacities and rules of observation. The raters assessed the capacity impairments in respect to the reference context “work on the general labour market” in a semi-structured interview for each of the 13 capacity dimensions [25]. In the sense of a naturalistic study, the raters based their assessment of the capacity impairments on general employability. Each of the 13 capacity dimensions was rated from 0 = no impairment, 1 = mild impairment, i.e., there are some difficulties for the person to fulfill the demands, but there are no negative consequences, 2 = relevant impairment, i.e., there are visible problems in fulfilling the demands, 3 = severe impairment, i.e., help from others is needed regularly in order to fulfill the demands and activities to 4 = full impairment, i.e., no respective activity is possible and complete dispensation is necessary. Sources of information were the observation during the interview and the information of capacity impairment explored from the patient. Exploration was done behavior-oriented.

Spearman correlations for the inter-rater reliabilities and an analysis of variance with repeated measurement for the analysis of significant changes in the assessment of capacities over the vocational training program were calculated. As there was no confirmatory research question for this naturalistic sample, no sample size was calculated in advance. According to McHugh [2], correlative measures for determining inter-rater reliabilities such as kappa or Spearman correlations can already be used with a small sample size of 5 participants or more. Sim and Wright [53] found that between two raters in a dichotomous variable, a significant kappa value with a power of 80% can be achieved in a one-tailed test when there are between 8 and 39 participants, depending on the level of the kappa value.

Since not all participants could be assessed at all three points of measurement, missing values were replaced by mean values when calculating the capacity ratings over the course of the vocational training program. The mean value imputation was performed by using the mean value from the ratings of the same participant at a different point in time (mean value of the time series) for calculation. To calculate the capacity ratings development over the course of a vocational training program, the mean value of both ratings was used for each point of measurement (Figs. 1, 2, 3 and 4).

Fig. 1
figure 1

Development of capacity impairment in the dimensions adherence to regulations, planning and structuring of tasks and endurance over the course of the vocational training (N = 61)

Fig. 2
figure 2

Development of capacity impairment in the dimensions applying expertise and capacity to judge and decide over the course of the vocational training (N = 61)

Fig. 3
figure 3

Development of capacity impairment in the dimensions assertiveness, contacts with others, teamwork capacity and familiar and intimate relationships over the course of the vocational training (N = 61)

Fig. 4
figure 4

Development of capacity impairment in the dimensions flexibility, mobility, self-care and proactivity over the course of the vocational training (N = 61)

Results

At the first measurement point (t0) two raters were available in 19 cases. At t1 there were 17 and at t2 14 cases which included ratings from two raters. The capacity impairments were rated by consultants and job attendants (Table 1).

Table 1 Professional background of raters

Inter-rater reliabilities were low at the first time point of measurement and increased over the course from t0 to t2 (Table 2). All 13 dimensions showed high significant correlations at t2 (r = 0.55*–0.97**). The highest concordance was found for adherence to regulations, mobility and proactivity (> 0.90**).

Table 2 Spearman correlations as measures of agreement between the assessments by master social professionals and consultants for the 13 dimensions of the Mini-ICF-APP at the three different points of measurement

None of the 13 capacity dimensions increased or decreased statistically significant in capacity impairment over the course from t0 to t2. In ten out of 13 capacities (adherence to regulations, endurance, applying expertise, capacity to judge and decide, assertiveness, contacts with others, teamwork capacity, familiar and intimate relationships, self-care, proactivity) there was a tendency of increasing impairment. Only three capacities (planning and structuring of tasks, flexibility, mobility) had tendentially decrease in impairments (Figs. 1, 2, 3 and 4). The high number of capacities with increasing impairment (instead of declining) is contradictory to the idea that the occupational training should improve the patient’s capacity levels.

Discussion

First, results of the study show that the inter-rater reliability increased over the course of continuous rater training from t0 to t2 in all 13 capacity dimensions. Learning effects of the raters can be assumed. It can also be assumed that along with the rater training, rating rules have been internalized over time. In Switzerland very specific quality criteria and in detail description of the rating rules are used in psychological assessments [54], which might have also contributed to higher inter-rater reliabilities over the course until t2. Concluding, our study shows that collegial exchange and rater trainings for measurement of capacity impairments are useful to increase inter-rater agreements.

Second, there were no significant changes in the participants’ capacity (impairment) levels over the course of a long-term vocational training program. In most capacities there was even a tendency for increased impairment. Explanations could be the either the findings show a real tendency to worsen the capacities, or that raters become more critical in their assessment over the time, or that the work required of participants to the vocational training program becomes more and more complicated over time. Possibly the raters get a more detailed knowledge of the participant’s capacities and impairments due to several observation appointments. Another study [55] has discovered similar findings on increasing capacity impairments during a five-week occupational therapy treatment. Depending on the changes over the course of the vocational training (e.g. changing working settings, further developments in qualification stages or the participant), capacity impairments may be assessed with different reference contexts at different times. Furthermore, observation data (e.g. concrete behaviors) can only be accumulated over the course, but are not all known in the beginning (t0). For example, a person in training for a salesperson may appear unimpaired in group capacity when asked before the training (t0), but during the training course impairments may become visible (e.g. if the person is on a work placement in a company and can be observed to produce conflicts in the team). When impairments become visible over the course, it is possible that a capacity impairment can be rated at t1 or t2 (in contrast to t0 when no impairment was observable).

Linden and Noack [55] interpreted the effects as a changed rater assessment. Similarly, in our present study it can be assumed that the raters got a more sophisticated assessment of the training participants over several months and thus evaluated more detailed and critically.

Most capacity impairments were low to moderate (rating 1–2 in a range of 0–4). Due to the setting, it can be assumed that the sample of vocational trainees was homogeneous in that there were moderate impairments (which build the basis for treatments like vocational trainings), but not total impairments (in this case there would be no basis for a positive prognosis of a vocational training).

Limitations

A limitation for the interpretation of the results was the small number of 14 (t2) to 19 (t0) training reports for the calculation of the inter-rater reliabilities and that (due to ratings from mixed professions) no statement can be made about the handling of the Mini-ICF of different professional groups. Another limitation may be a rather low power for the analysis of variance with repeated measures. Even though analysis of variance is a very robust procedure and overcomes such limitations [56], future studies should target larger samples. However, this is a sign of the normal naturalistic environment in which the investigation was conducted. As this is a naturalistic sample, no sample size was calculated a priori for the calculations. Moreover, some data were missing and replaced by mean values. The use of the mean as replacement value could have implied a levelling of the overall evolution in some of the 13 capacity dimensions.

Further Discussion and Outlook: Meaning of Vocational Trainings

This present study focused on the assessment of capacity impairments over the course in vocational trainings in naturalistic setting. It can therefore not provide data on the efficacy of vocational trainings. The impact of vocational trainings to increase the capacity levels might be evaluated in randomized controlled intervention studies by using stable and standardized reference contexts. Until now, several research studies have shown the usefulness of vocational trainings [44, 45, 46]. With concrete capacity assessments, evaluations can be done even more differentiated and with focus on behavior, activities and capacities – i.e., what is relevant in the concrete work settings (more than the type of illness or symptoms).