Background

Risk of bias (RoB) assessment is a crucial methodological aspect of systematic reviews and an obligatory part of Cochrane reviews [1]. A 2008 Cochrane RoB tool [2] had six domains, and one of them assessed “blinding of participants, personnel and outcome assessors” [3]. In the 2011 Cochrane RoB tool [4], this joint domain was split into two domains, one for blinding of participants and personnel (performance bias) and one for blinding of outcome assessors (detection bias) [1]. A new version of the Cochrane Handbook was published in 2019 [5], including the RoB 2 tool – where the assessment of blinding of the three key groups of individuals is split into three separate assessments [6]. With the evolution of the research methods, it is important to compare the revised versions with the previous versions, to ensure that the revised versions are indeed a step forward.

We have shown previously that Cochrane reviews authors frequently made inadequate RoB judgments using the 2011 RoB tool [7,8,9,10]. More specifically, in the performance bias domain, the overall proportion of RoB judgments following the recommendations from Cochrane Handbook (adequate judgments) was 73.6%, and the main error in reported RoB judgments was the presumption of healthcare providers being adequately blinded [11]. In the detection bias domain, the frequency of adequate judgments was 77.9%, and the main error was the improper categorization of outcomes (subjective vs. objective) [12]. Furthermore, we noticed that Cochrane authors still frequently use the joint domain for blinding of key individuals by making modifications to the 2011 Cochrane RoB tool, even though the tool contains two distinct blinding domains.

The aims of this study were twofold: first, to analyze the frequency of usage of the joint blinding domain, and second, to assess the proportion of adequate assessments made in the joint versus single RoB domains for blinding in Cochrane reviews by comparing whether authors’ RoB judgments were supported by explanatory comments in line with the Cochrane Handbook recommendations.

Methods

Study design

This was a primary methodological study in which we analyzed the methodology of Cochrane reviews published in the Cochrane Database of Systematic Reviews (CDSR). The study protocol was prepared a priori, but the protocol was not published. Raw data generated in this study are available on the Open Science Framework project page on the link https://osf.io/fmjxz/.

Inclusion and exclusion criteria

CDSR was searched for all reviews of randomized controlled trials (RCTs) of interventions (or both RCTs and non-randomized studies, but we analyzed RoB assessments only for RCTs) published from July 2015 to June 2016. This was a large, one-year convenient sample based on our previous studies [8, 11, 12], four years after the introduction of the 2011 RoB tool when it is expected from the review authors to have adopted the new methodology (tool). An advanced search option was used to limit results to content type and publication date. We excluded diagnostic Cochrane reviews, overviews of systematic reviews, empty or withdrawn reviews, and other Cochrane reviews containing no RCTs about interventions.

Screening for study eligibility

Titles and abstracts of Cochrane reviews were screened for eligibility by the first author (OB) and verified by another author (SD). The second author was verifying that no reviews were erroneously included/excluded. A list of analyzed Cochrane systematic reviews and studies included is presented in Supplementary file 1. The final unit of assessment was the risk of bias judgments for performance and detection bias of all the trials included in the eligible reviews.

Data extraction

The first author (OB) wrote series of macro-instructions in Visual Basic for Applications (VBA, Microsoft, Redmond, WA, USA) to automate data scraping of all the CSRs included in the study from The Cochrane Library webpage to Microsoft Excel 2010 (Microsoft, Redmond, WA, USA) workbook. The automatic extraction of RoB tables for every eligible Cochrane review was then done with a new set of coded instructions, as in our previous studies (https://osf.io/fmjxz/) [8]. Errors during data extraction were logged and checked manually by the lead author.

During error checkup and manual search for missing data, in two separate analyses of the domain for blinding of participants and personnel [11] and for blinding of outcome assessment [12], it was noticed that there is a subgroup of Cochrane reviews which used a joint domain for blinding of participants, personnel and outcome assessors. This particular subgroup of Cochrane reviews has been marked, selected, extracted, and used for this study, and it was not a part of any past analysis. The results of the previous analyses of the domain for blinding of participants and personnel and of the blinding of outcome assessment served in this work as comparators [11, 12]. The dataset used in this work was not a part of the previous analyses.

In our previous study [8], the first author (OB) developed a specific user interface (MS Excel VBA User Form) to facilitate parsing. This interface, for filling the MS Excel table, simply helped the authors with the transformation of natural language text (comments, citations) to ordinal or nominal variables for further analysis. The interface did not, in any way, change, calculate, or suggest the decision of the authors, i.e. the decisions were made by the authors and not automated.

Pilot tests (adjustments of the tool) were done in the studies mentioned above by most experienced authors (OB, SD and MB) on samples of 500 RCTs each. These authors used the same tool in this study, with no changes in appearance or coding.

Assessment of adequacy for joint blinding domain

The Cochrane Handbook explicitly instructs authors: ‘The support for judgement provides a succinct summary from which judgements of risk of bias can be made and aims to ensure transparency in how these judgements are reached.’. These supporting comments should be sufficiently informative for making a judgment. Thus, we assessed whether Cochrane authors’ RoB judgments were supported by the comments provided by authors in RoB tables.

In the first step of assessing judgments’ adequacy we made a new assessment of RoB based on supporting comments from Cochrane reviews, based on instructions from the Cochrane Handbook. In the second step, we compared these de novo judgments with judgments published in Cochrane reviews.

The new assessment of RoB for the joint domain was made for RCTs in which Cochrane authors provided both a judgment (risk of bias is low, high, or unclear) and accompanying comment. The only source for these assessments was the accompanying comment from the RoB table and the description of the intervention provided by the Cochrane authors, not the full texts of the original studies. The mentioned user interface was used just to enhance the visualization of the mentioned data and to ease the fulfilling of the MS Excel table. No full texts of the primary studies were analyzed. We followed instructions for rating detection bias from the Cochrane Handbook (Sects. 8.11.2 and 8.12.2) [13] and defined that four main questions need to be correctly answered to assess the blinding bias. Question #1: who was blinded? – to identify subjects (participants, personnel, and outcome assessors). Question #2: was blinding achieved and complete for relevant subjects? – because subjects have overlapping roles (e.g., participants can be self-assessors). Question #3: what was the outcome category? – to identify outcomes susceptible to bias. Question #4: can this outcome be influenced by lack of blinding – because not all outcomes are equally prone to performance and detection bias. All of the authors were experienced in RoB assessments as well as being clinicians (OB—senior surgeon, MB – experienced surgeon, and SD – anesthesiologist) considering the expertise in clinical aspects of outcome categorization.

Two authors (MB, SD) reassessed the RoB for their respective half of the sample. Due to the redundancy of the questions (Q#1 vs Q#3 and Q#2 vs Q#4) the lead author checked for the discrepancies and eventually corrected the assessment in about 20% of the cases. Lastly, we compared our new RoB assessments with the assessments made by the Cochrane authors. The proportion of RoB assessments by Cochrane authors matching the reassessment adhering to the Cochrane Handbook was termed – adequacy. The opposite term, inadequacy, does not necessarily mean the original judgment is incorrect but simply not justified by the supporting comment [14, 15].

Primary outcomes

RoB judgments, for the joint blinding domain, assigned by Cochrane authors were analyzed by number and adequacy (the proportion of judgments adhering to the Cochrane Handbook in all reassessed judgments). The definition standard in our assessment was the Cochrane Handbook, as specified in Table 8.4.d [16]. We considered that Cochrane authors' judgment was inadequate if it did not completely adhere to the Cochrane Handbook guidance (based on answers if blinding was achieved and whether the outcome was susceptible to bias). We compared adequacy in this joint domain to adequacy for the two individual domains – i.e., blinding of participants and personnel domain, and blinding of outcome assessor domain, based on results from our past works [11, 12].

Secondary outcomes

We analyzed the distribution of different types of outcomes (i.e., proportions of types of outcomes in all assessed judgments) in the performance bias domain, detection bias domain, and the joint domain. Primarily, our user interface offered a variety of pre-specified outcomes: all outcomes, not specified, objective (e.g., lab results, mortality, overall survival), outcomes rated/related/reported (RRR) by the clinician (e.g., complications such as occurrence if wound infection, adverse events such as pulmonary embolism, assessor/clinician related such as eye background description), patient-rated/related/reported or patient RRR (e.g., private phenomena such as the presence of fear, behavioral) and subjective in general. Due to overlap of characteristics of some types of outcomes and relatively small numbers, we used and analyzed a reduced list of outcomes: (i) all outcomes or not specified, (ii) objective outcomes (or subject independent), and (iii) subjective outcomes (including both clinician RRR and patient RRR). “Not specified” outcomes were the ones with a cell left blank in the RoB table (by default meaning all outcomes when inquired through RevMan interface) and thus were grouped. We also compared the distribution of severity of reassessed judgments (low, unclear, or high) for all three domains (performance bias domain, detection bias domain, and joint blinding domain).

Apart from the analysis of the whole joint blinding domain and comparison between two separate standard blinding domains, we performed analyses of subsamples when the joint domain for performance and detection bias was split into multiple subdomains according to the various outcomes. Here, we compared distribution (judgments of high, unclear, or low risk of bias) and adequacy of judgments to the whole sample.

Statistics

We presented descriptive data as frequencies and percentages. We used type I error α = 0.05 and type II error β = 0.2 for all statistical tests. Statistical analyses were performed using MedCalc for Windows, version 12.5.0.0 (MedCalc Software, Ostend, Belgium). Kolmogorov–Smirnov test was used to assess normality for all the datasets. For comparison of independent samples of non-parametric data, the Mann–Whitney test was used, and the Wilcoxon test was used for paired samples. A chi-squared test was utilized to asset the difference in proportions. Tukey fences were used for suspected outliers. Hypotheses, outcome measures, statistical tests used, and results are logged in Supplementary file 2.

Results

The analysis was conducted on 729 Cochrane reviews, with 10,527 included trials. There were a total of 6918 assessments for performance bias, 8656 for detection bias, and 3169 for the joint domain (Fig. 1, Table 1). Only 28 studies appeared in multiple reviews for the joint domain with a total of 57 judgments.

Fig. 1
figure 1

Flow diagram of the progress through the phases of the study and our previous studies

Table 1 Risk of bias (RoB) judgment adequacy in Cochrane reviews using two separate blinding domains compared to the joint domain

Primary outcome

The overall frequency of adequate assessments (the Cochrane authors' assessment matching to that of the assessors in the present evaluation, thus adhering to the Cochrane Handbook) was the lowest (59%; 1860/3169) in the joint domain (Table 1). This was significantly lower compared to 74% (5089/6918) for the performance bias domain (p < 0.0001) and versus 78% (6747/8656) for detection bias domain (p < 0.0001, Table 1, Supplementary file 2).

Secondary outcomes

Similar distribution of types of outcomes (subjective / objective / all) that authors specified was found for detection bias domain (13% / 5% / 82%) and joint blinding domain (14% / 3% / 83%, p = 0.358; Table 1). The distribution of reassessed judgments (high / low / unclear) differed through all three domains: joint blinding domain (29% / 15% / 54%) vs performance bias domain (41% / 16% / 43%) vs detection bias domain (20% / 26% / 54%); (p < 0.05; Table 1, overall row). In all of the three domains, the lowest frequency of adequate assessments was found when Cochrane authors made the judgment of low risk – 47% in performance bias, 62% in detection bias and 31% in the joint domain (Table 1, adequacy column).

Similar to our analyses in previous works, this analysis yielded ‘worse’ RoB judgment in 1046 (32.4%) of those trials (i.e., the judgment changed from originally low to unclear, or unclear to high), and ‘better’ RoB judgments in 273 (8.5%) trials (i.e., the judgment changed from originally unclear to low, or high to unclear), as shown in Table 2. We found that 198 (21.2%) of high-risk judgments made by Cochrane authors were reassessed as unclear or low, while 238 (23.6%) of the assigned unclear risk judgments were reassessed as either high or low risk. Two-thirds of the judgments 883 (68.8%) assigned low RoB for the joint domain were calculated to be of unclear or high risk.

Table 2 Difference in judgment provided by the Cochrane review authors and judgment in line with the Cochrane Handbook across different domains

Distribution and adequacy of judgments in the joint domain for subjective outcomes

Assessment of subjective outcomes demonstrated significantly lower adequacy in the joint blinding domain (57.3%) than in the two separate domains (performance bias domain 84.7%, p < 0.05; detection bias domain 86.9%, p < 0.05); see Table 1 and 3. In-depth analysis of assessments demonstrated the highest number of inadequate judgments among the subgroup of clinician RRR outcomes making 56% (N = 111) out of 187 inadequate judgments in the subjective outcomes group (Table 3). Furthermore, inadequate assessments were most common with judgments of low risk of bias (56/187, 30%) (Table 3 – inadequate judgments column).

Table 3 Distribution and adequacy of judgments in the joint blinding domain for subjective outcomes classified as subjective

Distribution and adequacy of judgments when the joint blinding domain is split according to various outcomes

Distribution of categories of outcomes in the whole joint blinding domain (3169 judgments: 83% all outcomes, 3% objective, 14% subjective) and its subsample of trials with domain split according to the type of outcome (N = 251 trials, N = 620 judgments, all outcomes 40%, objective 12%, subjective 48%) was significantly different (p < 0.05; Table 4). In this subsample, the percentage of adequate judgments for all or not specified outcomes was 40% compared to 83% in the whole sample (p < 0.0001; Table 4). Out of these 251 trials, 168 (accounting for 416 judgments) had the risk of detection bias judgment identical within all of their split outcomes (meaning in a single trial, all of the RoB judgments were of the same level: all high, all low, or all unclear). This subsample (Table 4) showed lower adequacy of judgments (44% vs. 59%, p < 0.05) than the whole sample. On the other hand, judgments in the rest of the trials which judged the risk of detection bias differently were as (in)adequate as in the whole sample (58% vs 59%, p = 0.978).

Table 4 Distribution and adequacy of judgments when the joint domain is split according to various outcomes

Discussion

The main finding of this study is that adequacy of RoB judgments about blinding in Cochrane reviews was better when Cochrane authors judged blinding of the key participants in two separate domains (i.e., one domain for participants and personnel, and another domain for outcome assessors), compared to one joint domain for all those three groups of individuals.

Separate domains force the review authors to provide a separate judgment for different groups of individuals; thus, assessments become more precise with split domains. This separation is relevant because there are specific difficulties for blinding different groups [8]. There might be a problem with the blinding of personnel, usually associated with the type of intervention. Also, participants may not be only passive recipients of interventions; they are often self-assessors of outcomes when patient-reported outcomes are used. Thus, the lack of blinding of specific individuals involved with a trial does not lead to a high risk of bias only if the outcome is objective.

Sometimes Cochrane reviewers specified the type of outcome for which the domain was judged, i.e., whether they considered an outcome objective or subjective. The distribution of assessments according to different types of outcomes demonstrated certain similarities between the joint domain and the detection bias domain. Both domains had a very low rate of specified outcomes (17% joint blinding, 18% detection bias domain), but in contrast to the performance bias domain (with less than 5%), this might be seen as a success. We might conclude that much more effort has to be introduced to identify outcomes susceptible to bias related to blinding in trials, as well as taking care of proper blinding of the subjects.

In our previous study [8], assessments of outcomes defined by Cochrane authors as subjective were significantly more often accurate than outcomes in general. This was due to the relatively high proportion of judgments for “high risk” that were highly accurate. Increased adequacy (objective 81% vs. subjective 57%) came from better precision in the definition of objective outcomes. However, less adequate assessments of subjective judgments did not originate from the distribution of risk judgments, which did not differ from the detection bias domain, as stated before.

Clinician-related outcomes that were judged with low risk by Cochrane authors contributed the most to inadequate judgments. Among these, the majority defined such an outcome as objective, even though it was not (e.g., completeness of treatment or established clinical test rating), a problem linked to the detection bias domain. Some RoB tables did not have enough detail in supporting comment, e.g., they used a vague “double-blind” comment without specifying who exactly was blinded, which is a frequent explanation that reviewers use when describing their rationale for assessing the performance bias domain. This likely stems from the primary studies, where the usage of the term “double-blind” without any further details about blinding of key individuals is widespread; however, it has been shown repeatedly that the term is ambiguous and that it means different things to different researchers [17,18,19]. Thus, it is recommended that trialists should not use the term “double-blind”, but instead report transparently who exactly was blinded in a trial.

Additionally, we found that Cochrane authors sometimes split the joint domain into multiple subdomains for different outcomes. While this approach may be considered more transparent regarding different types of outcomes (showing, for example, separate judgments for subjective vs objective outcomes), such reviews had much worse results in terms of RoB adequacy. Therefore, we have demonstrated that splitting a joint blinding domain only according to the outcomes is not a preferable solution. Splitting (i.e., providing more granular information) should be used based on the different groups of individuals (three separate domains for judging whether participants, personnel and outcome assessors were blinded) and the susceptibility of an outcome to be influenced by knowledge of intervention received, such as in RoB 2.

Cochrane methods are continuously evolving. Our findings indicate that the decision to split the domain about blinding into two separate domains was justified, as the adequacy of judgment was better in separate domains. This is easily understood, as the joint domain refers to multiple groups of participants, and therefore it may be unclear how Cochrane authors are judging RoB related to blinding in domains covering more than one group of participants. For this reason, we hypothesize that the decision to split further the assessment of RoB related to blinding to three assessments in the RoB tool 2 will prove to be even more advantageous for accurate assessments [6]. However, this hypothesis will need to be tested in the future, as the RoB tool 2 is still in its implementation phase, and Cochrane authors are still not obliged to use it.

Strengths and limitations

Our study's strength is that we have analyzed a large number of Cochrane reviews with more than ten thousand trials included. We have focused on Cochrane reviews because the use of Cochrane methods, i.e., Cochrane RoB tool, is mandatory in Cochrane reviews, but our results are also relevant for non-Cochrane systematic reviews. Although the majority of non-Cochrane reviews do not report on RoB [20, 21], when they do, their reporting is sub-optimal [1, 22], and their authors also use Cochrane RoB tool inadequately [22].

There are also some potential limitations to our work. Firstly, even though we prepared a study protocol before commencing this study, we did not publish the study protocol, as there is still no requirement in the international community for publishing protocols of studies other than clinical trials. However, we are aware that publishing the study protocol prospectively could be important for readers for appraising the risk of selective reporting and any other biases that may have occurred due to changes to the protocol during the study.

Additionally, there may be differences between assessments made in the original Cochrane reviews in data availability, as the Cochrane authors have appraised reports of included RCTs, and they might have contacted trial authors for clarifications. For this reassessment, we relied on comments provided by Cochrane authors in RoB tables. Cochrane authors should provide informative comments to explain the rationale for their judgments as instructed by the Cochrane Handbook. If the authors did not report all the key information transparently in the supporting comment, the judgment might not be sufficiently justified. The concept of adequacy, used in this study, might still be subjective because it was ultimately determined by the authors of this manuscript (although we did our best to follow guidance from the Cochrane Handbook strictly) [14, 15].

The categorization of outcomes as objective or subjective was made by our team. It needs to be emphasized that outcomes are often not fully objective or fully subjective but instead fall somewhere on the continuum between objective and subjective. It is possible that clinician input during the execution of the Cochrane reviews could have influenced the risk of bias judgments, at least partially explaining why the assessments in the reviews would be different from those undertaken in this study.

Furthermore, some may consider that blinding is not well defined in the Cochrane Handbook and that neither Cochrane authors nor our team could categorically determine whether the Cochrane Handbook criteria have been met. For this reason, we have transparently reported our judgments and rationale behind our assessments: raw data generated in this and related manuscripts can be located on the Open Science Framework project page on the link https://osf.io/fmjxz/.

It could also be argued that blinding is a poorly defined construct. For example, blinding could be a property of the trial methods (in which case assessment of blinding would involve assessing the presence/adequacy of the placebo or sham), but also it can be manifested in the knowledge or beliefs of key individuals about the allocation of interventions; in the latter case evaluation of blinding would involve assessing knowledge or beliefs of the key individuals about the allocation [23].

In this study, we analyzed Cochrane reviews published within a limited date range from July 2015 to June 2016. However, we have no reason to believe that the results would be different if we have used a more extended period after June 2016. We did not choose an earlier period than July 2015 because the analyzed Cochrane RoB tool was published in 2011, and we considered it essential to leave out the first few years after its publication to allow Cochrane reviewers to adopt the new methodology. Regarding the inclusion of a higher number of more recently published Cochrane reviews, we have evidence from our recent methodological study that this is not needed [24]. In that study, we initially analyzed 768 Cochrane reviews that were published in 2015 and 2016. Based on editors' request, we expanded our eligibility criteria to two more years, up to the year 2018. However, our subsequent analysis indicated no difference in our results at all, despite doubling the number of included Cochrane reviews and expanding our eligibility period from one to three years [24]. Additionally, there are no uniform guidelines regarding search periods in methodological studies, and it has been suggested that extended periods should be considered when some significant changes can be expected [25]. Thus, we argue that our data are relevant, considering the eligibility criteria we used.

Conclusion

Our results indicate that splitting the joint RoB domain about blinding key individuals into two separate domains was justified. Cochrane authors more frequently made adequate judgments in separate domains for blinding. We anticipate that this should result in an even higher adequacy of judgments in the Cochrane RoB 2 tool, but this will need to be confirmed after its full implementation in Cochrane reviews.