Integrated Mental Healthcare and Vocational Rehabilitation for People on Sick Leave with Anxiety or Depression: 24-Month Follow-up of the Randomized IBBIS Trial

Integration of vocational rehabilitation and mental healthcare has shown some effect on work participation at 1-year follow-up after sick leave with depression and anxiety. We aimed to study the effect on work and health outcomes at 2-year follow-up, why we performed a randomized trial was conducted to study the effectiveness of integrated intervention (INT) compared to service as usual (SAU) and best practice mental healthcare (MHC). We included 631 participants, and at 24-month follow-up, we detected no differences in effect between INT and SAU. Compared to MHC, INT showed faster return-to-work (RTW) rates (p = 0.044) and a higher number of weeks in work (p = 0.024). No symptom differences were observed between the groups at 24 months. In conclusion, compared to SAU, INT was associated with a slightly higher work rate reaching borderline statistical significance at 12-month follow-up and lower stress levels at 6-month follow-up. The disappearance of relative effect between 12 and 24 months may be explained by the fact that the intervention lasted less than 12 months or by delayed spontaneous remission in the SAU group after 12 months. Despite the lack of effect at long-term follow-up, INT still performed slightly better than SAU overall. Moderate implementation difficulties, may partly explain the absence of the hypothesized effect. Integrated intervention, as implemented in this trial, showed some positive effects on mid-term vocational status and short-term stress symptom levels. However, these effects were not sustained beyond the duration of the intervention. Supplementary Information The online version contains supplementary material available at 10.1007/s10926-023-10094-7.

• In the statistical analysis plan we planned statistical assumption control of proportional hazards in cox-regression, and that we would alternatively adjust for different kinds of interactions between time and group assignment. In the primary outcome analysis of difference between the INT and SAU groups we found non-proportional hazards, but did not adjust for time, despite that, since the original intention was to conservatively test a rejection of the null-hypothesisnot vice versa.
• We discarded the outcome time from the first day of return to work until recurrent sick leave, since we realized that this time would not consistently reflect a positive outcome due to the high risk of bias.
• Neither the statistical analysis plan nor the study design article mentioned presenting a proportion per time-curve. Post-hoc we decided that this would be beneficial.
• In the statistical analysis plan, we had planned adjusted analyses with any variables that differed at baseline. Though age was unevenly distributed at baseline, we did not adjust for age in main or sensitivity analyses. Social and work related function measured by WSAS (4) Burn-out symptoms measured by Karolinska Exhaustion Scale (KES) (5) Health-related quality of life measured by EQ-5D-5L (6) General Quality of life scale measured by Flanagan's' QOLS(7) Self-efficacy concerning symptoms measured by IPQ subscale on personal control (8) Return to work self-efficacy measured by RTW-SE (9) General self-efficacy measured by General Self-efficacy scale (GSS) (10) Presenteeism measured by Stanford Presenteeism Scale (SPS) (11) Table moderated from Statistical analysis plan (12) Definition of the beneficial outcome direction for all outcomes (6,12, and 24-month-follow-up)

Outcome
Is "better outcome" defined by lower or higher numbers? Health-related quality of life measured by EQ-5D-5L(6) Higher(14) General Quality of life scale measured by Flanagan's' QOLS(7) Higher Self-efficacy concerning symptoms measured by IPQ subscale on personal control (8) Higher (15) Return to work self-efficacy measured by RTW-SE (9) Higher General self-efficacy measured by General Self-efficacy scale (GSS) (10) Higher Client satisfaction with treatment measure measured by CSQ-8 (16) Higher Presenteeism measured by Stanford Presenteeism Scale (SPS) (11) Higher (17)

TITLE AND TRIAL REGISTRATION
This SAP is the detailed statistical analysis plan, expanding the scientific IBBIS protocol of the two IBBIS randomized clinical trials (ClinicalTrials.gov Identifiers: NCT02872051 (RCT1 1 ) and NCT02885519 (RCT2 2 )): Due to extensive methodological similarities between these studies this SAP applies to both, unless differences are mentioned explicitly.

SAP VERSION
This is the second version of the SAP.
Differences between version 1 and 2 are explicitly stated through a .docx-version of this newest version, where all changes are tracked, using the Track changes function in Microsoft Office Word. This file will be readily mailed through the corresponding author.
In brief, main changes revolves around the 24-month explorative outcomes: after analyses of 6-and 12month outcomes we realized that results across work outcomes were heterogenous to a higher extent than expected. E.g., while the SAU group tended to have fastest RTW at 6-month follow-up, from explorative proportion over time-curves we realized that they might also tend to have a higher degree of recurrent sickleave. Despite this, the SAU and INT groups still showed approx. the same number of weeks in work (when stability is disregarded) at 12-month follow-up. Followingly we speculate that the SAU group experiences faster RTW, but more recurrent sick leave. Therefore, we suggest that number of weeks in stable work is a better outcome, since this number is only high if RTW happens early, and if it is stable, and not disrupted by sick leave recurrence. We do though not know what stability threshold we should apply, and to explore this, we defined, prior to 24-month analyses, three different outcomes, with three different thresholds, see section 6.4. We plan these as sensitivity analyses.

PROTOCOL VERSION
Previous to publication of this SAP, plans have been described in both the protocol (published on clinicaltrials.org in the links provided), as well as in two study design articles (SDA), corresponding to the two RCTs 2,3 .

INTRODUCTION: BACKGROUND, RATIONALE AND OBJECTIVES
Described thoroughly in the SDAs 2,3 . Furthermore, the protocol was published 3 on the official webpage of the organization (Mental Health Services, Capital Region of Denmark).

RANDOMIZATION
From SDA 2 : "The allocation ratio between the three arms is 1:1:1. A centralized randomization will take place according to a web-based computer-generated allocation sequence with varying block sizes kept unknown to the assessors. Odense Patient data Explorative Network (OPEN) is responsible for the randomization, administrative personnel in the IBBIS team perform the online randomization and the IBBIS team leader assign the participant to interventions and professionals.
We expect that service delivery can vary from municipality to municipality and the process of gaining a new job from unemployment will take longer time than returning to an existing job. Previous research has shown that diagnosis is a possible predictor of return to work 4 . Thus, the randomization is stratified according to 1) municipality 2) employment status (on sick leave from work vs. on sick leave from unemployment) 3) diagnosis […]" In RCT 1 diagnosis stratification is depression versus anxiety as primary diagnosis, and in RCT 2 diagnosis stratification is burnout vs. distress vs. adjustment disorder as primary diagnosis.

SAMPLE SIZE
Replicated from protocol follows: The sample size is based on a sample size calculation, using the 'Power and Sample Size' calculation programme 4 .

Type I error (α) risk
In each of the two RCTs we wish to conduct multiple comparisons (between 3 groups), and hence significance level must be as follows, due to Bonferroni correction:

Type II error (β) risk
The organizational constellation of the interventions has not yet been trialled, and thus the desired power shall be set to: = 0,9 If it turns out that we cannot include enough participants, the power could be set to: = 0,8

Hazard ratio (R)
The mean difference in time for return to work will be calculated as a hazard ratio. We estimate that as sufficient HR is = 1,5 since just 50 % faster return to work time in the intervention groups will convey a relevant economic benefit, due to the hence smaller loss of productivity.

Mean time to return to work (M1)
Number of days from baseline to return to work is conservatively estimated to be 210 days, after an observed range from 104 to 210 days, in the control groups in three Dutch RCTs [5][6][7] , which were comparable to the control groups in the IBBIS RCTs. Hence,

Inclusion time period (A)
We will include participants through 24 months,

Follow-up time (F)
We will follow participants up for 365 days, in which they will contribute with risk time in the survival analysis, hence F = 365

Result
In each group, due to the above-mentioned variables, we need

STATISTICAL INTERIM ANALYSES AND STOPPING GUIDANCE
No interim analysis will be performed. We planned no stopping guidance.

TIMING OF FINAL ANALYSIS
The researchers who will perform the 6-and 12-month outcome analyses (AH and JF) will be blinded from intervention group allocation, until the primary outcome and all 12-month follow-up outcome main analyses are completed. The true randomization group allocation is concealed, with values X, Y and Z reflecting group allocation in the blinded dataset. The randomization allocation variable conversion formula is until unblinding only know and hidden by an administrative co-worker, who will not perform or assist any analysis.
At the time of publication of SAP version 1, baseline distributional analyses, and unadjusted estimated marginal means-analyses of self-reported numerical secondary outcomes at 6-month follow-up (and only these) have been calculated blinded, but will not be published, since this was not complying with the SDAs, nor any SAP version.
All 24-month follow-up analyses will be conducted unblinded.

ANALYSIS POPULATIONS
All analyses are performed as intention-to-treat, unless otherwise stated.

WITHDRAWAL AND FOLLOW-UP
Due to legislative circumstances participants can withdraw consent, and followingly all person sensitive data on these subjects will be deleted, yet participant ID number (not CPR number, but generated for this research project) and randomization result will be stored. In sensitivity analyses these ID numbers will be included, as described in "handling of missing data".

BASELINE PATIENT CHARACTERISTICS
The following will be reported per RCT, per randomization allocation group. For all mean values of numeric variables, standard deviations will be reported. Distributional balances of these covariates (except educational level, since this is only added in SAP v. 2, after primary baseline analyses) will be calculated using one way-ANOVA for numerical data and Χ 2 for categorical data, and analyses with p≤0,05 will define imbalanced baseline covariates.

ANALYSIS
The first subsections of this section 6, describes general strategies applying to all analyses unless otherwise specifically stated. Subsection 6.8 contains the separate analysis strategies per outcome in 6.8.x.

COVARIATE ADJUSTMENT IN GENERAL
Analyses will be adjusted for the three stratification variables, and no other, complying with RCT analysis guidelines from European Medicines Agency 5 .

SENSITIVITY ANALYSES IN GENERAL
As sensitivity analyses, all outcome analyses will be performed adjusted for any unbalanced baseline covariates, as defined in 5.2, Baseline patient characteristics.
Results of sensitivity analyses are only interpreted as supplements to the main analysis and will not substitute main results.

SENSITIVITY ANALYSES FOR QUESTIONNAIRE BASED, SELF-REPORTED DATA OUTCOME
As sensitivity analyses, self-reported data outcomes (questionnaire-based) will be calculated with all missing outcome data replaced with a value equalling the mean of the outcome variable ± 2 standard deviations, and participants who withdraw themselves from the study will be included in these analyses with all their data handled as missing.

SENSITIVITY ANALYSES FOR REGISTER DATA BASED OUTCOMES
For register data-based outcomes, sensitivity analyses will be performed including the participants who withdraw themselves from the study, included in these analyses with all their outcomes handled as either the worst possible (never returning to work) vs best possible (returning to work as soon as possible).
Furthermore, all outcomes of number of weeks in stable return to work (outcome number 9, 10, 11 and 12), are sensitivy analyses, exploring the robustness of number of weeks in work (stability disregarded), which is oucome number 13, pre-planned before study commencement.

SUBGROUP ANALYSES IN GENERAL
All outcomes will be analysed with respect to the following subgroups: a) per primary diagnosis (in RCT1 anxiety vs. depression; in RCT2 per distress, adjustment disorder, and burnout); b) per employment status group at baseline (vacant vs. employed); c) per IBBIS Team (two teams, Team North and Team Byen) Furthermore, d) divided in two groups by relative time of randomization: first and last temporal half of randomized participants.
Finally, e) we will test for interaction between diagnostic group and treatment allocation group/arm.
No outcomes have other subgroup analyses planned.

OUTCOME DEFINITIONS
The outcomes are reported as in the study design articles (except for selected outcomes, see alterations to SAP version 1 in appendix). The numbers 1 through 64 denotes the outcome numbers for reference purposes for this SAP section.

HYPOTHESES AND NULL-HYPOTHESES
Stated below are the generic versions of all three hypotheses (H1) and all three null-hypotheses (H0) that apply to each outcome.
Regarding what is a "better outcome" is listed in section 6.6, defined for each outcome measure, respectively.

HYPOTHESES
This superiority trial hypothesizes that, for all outcomes, H1A Group 3, "Integrated IBBIS mental health care treatment and vocational rehabilitation" conveys better outcomes than The groups are thoroughly described in the IBBIS Protocol and the SDAs.

NULL-HYPOTHESES
The corresponding null-hypotheses are H0A Group 3, "Integrated IBBIS mental health care treatment and vocational rehabilitation" does not convey better outcomes than Group 2, "IBBIS mental health care (and standard VR)", and H0B Group 2, "IBBIS mental health care (and standard VR)", does not convey better outcomes than Group 1, "Control group, treatment as usual (standard MHC and standard VR)". and followingly H0C Group 3 does not convey better outcomes than Group 1.

OUTCOME BENEFIT DIRECTION
Referring to the hypothesis section, this table describes whether a "better outcome" is a higher or lower score on the numeric outcome variables.

Outcome
Is "better outcome" defined by lower or higher numbers? Return to work self-efficacy measured by RTW-SE 17 Higher General self-efficacy measured by General Self-efficacy scale (GSS) 18 Higher Client satisfaction with treatment measure measured by  Higher Presenteeism measured by Stanford Presenteeism Scale (SPS) 20 Higher 23

MISSING DATA IN GENERAL
In general, proportion of missing data will be reported per intervention group for all outcomes.

HANDLING OF MISSING DATA IN REGISTERS
For RTW-outcomes (outcomes based on the DREAM register) we expected no missing data, due to the nature of the Dream Register, prior to study inception. Missing data should only be in case of a participant moving out of Denmark. We considered these events to be so rare in our data that we would handle such missing data as missing completely at random. Thus, no imputation or other correction was considered necessary. We will report proportion of data missing.
We will report number of censored participants per treatment group.
At the of this updated version 2a of the SAP, we have realized that some data were missing due to DREAM database errors, against expectation. We included the cases with missing data in sensitivity analyses to explore the potential impact of the missingness.

HANDLING OF MISSING DATA IN QUESTIONNAIRE BASED, SELF-REPORTED DATA OUTCOME
For questionnaire-based outcomes, missing data will be handled as missing at random. To handle this, 100 multiple imputations will be performed, using following variables: stratification variables: diagnosis, municipality, employment status; age; gender; time to stable RTW; psychometric variables at baseline and all follow-up at outcome time: BDI, BAI, WSAS and PSS.

ANALYSIS METHODS PER OUTCOME GROUP
This section describes the details of the statistical analyses. Since several outcomes require exact same analysis methods, outcomes are grouped for the following description

TIME TO RETURN TO WORK-OUTCOMES (OUTCOMES #1, #3 AND #8)
This section describes primary outcome Time from baseline to RTW at 12-month follow-up (1), and the secondary outcomes Time from baseline to RTW at 6-(outcome 3) and 24-month follow-up (outcome 8).
The 24-month follow-up outcome will be calculated no earlier than June 2020. The other two, readily after the publication of this SAP, but before unblinding of analysists.

APPLICABLE)
Time from baseline to RTW is defined af the number of weeks from randomization date, to stable return to work. Stable return to work is defined as 4 weeks consecutively in work, i.e. with no sick leave benefit those 4 weeks in the Dream register, and a so-called "branch code" in at least some of this 4 week period (benefit codes are week-based, branch codes are month based, and hence a period of 4 weeks may represent only one month, or overlap a two month period; in the latter case, return to work will be attained if at least one of these registrations contains a branch code; a branch code means that the individual received salary from an employer in this period). Time of event is first day of the four weeks.
At randomization all participants are, according to inclusion criteria, on sick-leave from employment or vacancy. Some participants might be on sick-leave from an employment in a flexjob 7 , and hence receiving flexjob benefit during employment. This benefit is changed to flexjob sick-leave benefit similar to regular sick leave benefit for participants not granted flexjob benefit prior to randomization. In these cases (of participants granted flexjob benefit prior to randomization) RTW is defined as either not receiving flexjob sick-leave benefit for four consecutive weeks, along with a registered branch code as above mentioned (or alternatively not receiving flexjob benefit, but an ordinary salary indicated by a branch code during those four weeks).
For participants, who at baseline are on sick-leave from vacancy (but not receiving flexjob benefit), RTW can both be defined as above mentioned (four consecutive weeks without sick leave benefits and a branch code during those four weeks) or receiving flexjob benefit for four consecutive weeks and a branch code during those four weeks.

SPECIFIC ANALYSIS METHOD AND RESULT PRESENTATION
Comparisons of RTW time will be calculated as hazard rate ratios between groups (and corresponding 98,3%CI), using a Cox-regression model.
Kaplan-Meier curves will be presented to illustrate the cumulative incidence of first stable return to work event in each trial-arm.

COVARIATE ADJUSTMENT
Only for stratification variables, see 6.1 "Covariate adjustment in general".

STATISTICAL METHOD ASSUMPTION CONTROL
Assumptions for the proportional hazards (~Cox-) regression model are proportional hazards; this will be controlled performing af Schoenfeld (SF) test for residuals and visual inspection.

ALTERNATIVE ANALYSIS METHOD IN CASE OF ASSUMPTION FAIL
If the SF test is positive (p<0,05), the analysis will we performed adjusted for the interaction between time and treatment group allocation. If SF test hereafter is still positive, the analysis will instead be adjusted for the interaction between quadratic time (time 2 ) and treatment group allocation. If SF test hereafter is still positive, the analysis will instead be adjusted for the interaction between log(time) and treatment group allocation. If SF test hereafter is still positive, the analysis with the highest p-value will be reported.

SENSITIVITY ANALYSES
See "6.2.2 Sensitivity analyses for register data based outcomes".

REPORTING AND STATISTICAL METHODS TO HANDLE MISSING DATA
On RWT-outcomes we expect no missing data, due to the nature of the Dream Register. Missing data will only be in case of a participant dying or moving out of Denmark. We consider these events to be so rare in our data that we will handle such missing data as missing completely at random. Thus, no imputation or other correction is necessary. We will report proportion of data missing.
We will report number of censored participants per treatment group.

CALCULATION OF THE OUTCOME: SPECIFIC MEASUREMENT AND UNITS (AND TRANSFORMATION, WHERE
APPLICABLE) This outcome is calculated as the share of the treatment allocation group that on the time of follow-up was in stable RTW (≥ 4 weeks). Stable RTW if defined exactly as in the primary outcome, see 6.8.1.1.

SPECIFIC ANALYSIS METHOD AND RESULT PRESENTATION
Pairwise odds ratios will be calculated using logistic regression.
In addition to the presentation of odds ratios for tests at 12-month follow-up and 24-month follow-up, graphs are presented with the proportions in stable work at each week (week 1-52 for 12-month follow-up and week 1-104 for 24-month follow-up) for each of the three trial-arms. No statistical test will be performed for differences at week 1-51 or week 53-103. These curves are explorative, descriptive analyses.

COVARIATE ADJUSTMENT
Only for stratification variables, see 6.1 "Covariate adjustment in general".

STATISTICAL METHOD ASSUMPTION CONTROL
The assumptions of the model are assumed to be acceptable, due to large sample, binary outcome, categorical independent variable.

ALTERNATIVE ANALYSIS METHOD IN CASE OF ASSUMPTION FAIL
No alternative methods are planned, since assumptions are assumed to hold.

SENSITIVITY ANALYSES
See "6.2.2 Sensitivity analyses for register data based outcomes".

CALCULATION OF THE OUTCOME: SPECIFIC MEASUREMENT AND UNITS (AND TRANSFORMATION, WHERE
APPLICABLE) All outcomes are calculated as the sum of scores on the respective measurement scales.
All 6-month follow-up outcome analyses are calculating using baseline and 6-month follow-up observations. All 12-month follow-up outcome analyses are calculating using baseline and 6-and 12-month follow-up observations.

SPECIFIC ANALYSIS METHOD AND RESULT PRESENTATION
Linear mixed-effects model with unstructured covariance. Results will be presented in pairwise group differences between outcomes, from the estimated marginal means from the model, and the confidence intervals of these differences.

COVARIATE ADJUSTMENT
Only for stratification variables, see 6.1 "Covariate adjustment in general".

STATISTICAL METHOD ASSUMPTION CONTROL
Assumption: normal distribution of scores. Control: Visual inspection by plotting the score residuals.
Assumption: normal distribution of individuals' score differences between baseline and follow-up. Control: Visual inspection by plotting the score difference residuals.
Assumption: Equality and homogeneity of variance. Control: Breusch Pagan test and Bartlett's test are used to identify violations of these assumptions.

ALTERNATIVE ANALYSIS METHOD IN CASE OF ASSUMPTION FAIL
In case of positive tests or visual inspections a robust variance estimator is used to correct standard errors.

SENSITIVITY ANALYSES
See "6.2.1 Sensitivity analyses for questionnaire based, self-reported data outcome".

REPORTING AND STATISTICAL METHODS TO HANDLE MISSING DATA
Proportion and amount of missing data per outcome variable per follow-up event per treatment group will be reported.
To handle missing data, 100 multiple imputations will be performed, using following variables: stratification variables: diagnosis, municipality, employment status; age; gender; time to stable RTW; psychometric variables at baseline and all follow-up at outcome time: BDI, BAI, WSAS and PSS.

APPLICABLE)
From baseline to follow-up, the number of weeks in work per participant is calculated. A week is noted as being in work, if no sick leave benefit has been received, and if a branch code is registered in the month of that week (branch codes are registered on monthly basis, if an individual has received salary from an ordinary job during that month).
For participants receiving flexible job benefit prior to randomization, and participants on sick leave from vacancy, the same principles apply, as described in 6.8.1.1, in the section "Time to return to work-outcomes (outcomes #1, #3 and #8)".
At 24-month follow-up, this analysis is conducted with three variations each applying a different definition of return to work stability as sensitivity analyses. Whereas the first analysis uses the definition of stability from the primary outcome (minimum four weeks see section 6.8.1.1), these sensitivity analyses are conducted with a more conservative approach where stable return to work is defined as minimum 4, 8 and 12 weeks in work respectively.

SPECIFIC ANALYSIS METHOD AND RESULT PRESENTATION
Severely skewed data is expected for this outcome, why a robust Poisson regression model will be used to test the differences between groups.

COVARIATE ADJUSTMENT
Only for stratification variables, see 6.1 "Covariate adjustment in general".

ALTERNATIVE ANALYSIS METHOD IN CASE OF ASSUMPTION FAIL
If Χ 2 goodness-of-fit test is significant, negative binomial regression model will be used instead. If Χ 2 goodness-of-fit test is significant for this distribution, zero inflated poisson regression will be used.

SENSITIVITY ANALYSES
See "6.2.2 Sensitivity analyses for register data based outcomes". APPLICABLE) For each group, the number of persons who have experienced the event 'stable return to work' and followingly experienced the event 'recurring sick leave' is calculated. Recurring sick leave is defined as the first sick leave period starting with the fist week of receiving sickness benefit after a period of stable return to work as defined in paragraph 6.8.1.1.

SPECIFIC ANALYSIS METHOD AND RESULT PRESENTATION
Only descriptive statistics will be performed for this outcome and no differences between groups will be tested. For each group, the number of persons who have experienced stable return to work and the number of persons who have experienced recurrent sick leave is presented. 6.8.6 HARM MEASURES AT 12-, AND 24-MONTH FOLLOW-UP (OUTCOME #49-64)

APPLICABLE)
For each group, the number of persons who have experienced the harmful event is calculated.

SPECIFIC ANALYSIS METHOD AND RESULT PRESENTATION
Only descriptive statistics will be performed for this outcome and no differences between groups will be tested.