Background

The peer review process is a cornerstone to improve the quality of scientific publications [13]. It is used by most scientific journals to inform editors’ decisions and to improve the quality of published reports [4]. Worldwide, peer review costs an estimated £1.9 billion annually and accounts for about one-quarter of the overall costs of scholarly publishing and distribution [5, 6]. Despite this huge investment, primary functions of peer reviewers are poorly defined [7] and the impact and benefit of peer review are increasingly questioned [813]. Particularly, studies have shown that peer reviewers were not able to appropriately detect errors [14, 15], improve the completeness of reporting [16], or decrease the distortion of the study results [1719]. Some interventions have been developed and implemented by some editors to improve the quality of peer review [4, 2022]. In 2007, Jefferson et al. [23] published a systematic review evaluating the effect of processes in editorial peer review through a search performed in 2005. The authors included prospective and retrospective comparative studies and concluded that little empirical evidence was available to support the use of editorial peer review as a mechanism to improve quality in biomedical publications. To our knowledge, no recent systematic review, including studies published over the last 10 years, has been published.

We aimed to perform a systematic review and a meta-analysis of all randomized controlled trials (RCTs) evaluating the impact of interventions to improve the quality of peer review in biomedical journals.

Methods

Search strategy

We searched the Cochrane Central Register of Controlled Trials (CENTRAL; June 2015, Issue 6), MEDLINE (via PubMed), Embase, and the WHO International Clinical Trials Registry Platform for all reports of RCTs evaluating the impact of interventions aiming to improve the quality of peer review in biomedical journals (last search: June 15, 2015). Biomedical research is defined as the area of science devoted to the study of the processes of life, the prevention and treatment of disease, and the genetic and environmental factors related to disease and health. We also searched the Cochrane Database of Systematic Reviews to identify systematic reviews on the peer review process. We had no limitation on language or date of publication. Our search strategy relied on the Cochrane Highly Sensitive Search Strategies [24] and the search term “peer review”. We hand-searched reference lists of reports and reviews dedicated to the peer review process identified during the screening process.

Study selection

Two researchers (RB, AC) independently screened all citations retrieved by using the Resyweb software. We obtained and independently examined the full-text article for all citations selected for possible inclusion. Any disagreements were discussed with a third researcher until consensus was reached.

We included RCTs, whatever the unit of randomization (manuscript or peer reviewers), evaluating any interventions aimed at improving the quality of peer review for biomedical publications regardless of publication language. We excluded RCTs evaluating the presence and effect of peer reviewer bias on the outcome, such as positive outcome bias. Non-randomized studies were excluded. Duplicate publications of the same study were collated for each unique trial.

Interventions

We pre-specified the categorization of the interventions evaluated as follows [4]:

  1. 1)

    Training, which included training or mentoring programs for peer reviewers to provide instructional support for appropriately evaluating important components of manuscript submissions. These interventions directly target the ability of peer reviewers to appropriately evaluate the quality of the manuscripts.

  2. 2)

    Addition of peer reviewers for specific tasks or with specific expertise such as adding a statistical peer reviewer, whose main task is to detect the misuse of methods or misreporting of statistical analyses.

  3. 3)

    Peer reviewers’ use of a checklist, such as reporting guideline checklists, to evaluate the quality of the manuscript.

  4. 4)

    “Open” peer review process, whereby peer reviewers are informed that their name would be revealed to the authors, other peer reviewers, and/or the public. The purpose of these approaches is to increase transparency and thus increase the accountability of existing peer reviewers to produce good-quality peer reviews.

  5. 5)

    “Blinded”/masked peer review, whereby peer reviewers are blinded to author names and affiliation. Author names and/or potentially identifying credentials are removed from manuscripts sent for peer review so as to remove or minimize peer reviewer biases that arise from knowledge of and assumptions about author identities.

  6. 6)

    Other interventions. Any other types of interventions identified were secondarily classified.

Outcomes

The outcomes were as follows:

  1. 1)

    Final manuscript quality (i.e., revised manuscript after all peer review assessments and rounds). Several scales were used to assess manuscript quality. The Manuscript Quality Assessment Instrument, a 34-item scale, each item scored from 1 to 5 [25], aimed to evaluate the quality of the research report (i.e., whether the authors have described their research in enough details and with sufficient clarity so a reader could make an independent judgment about the strength and weakness of their data and conclusion), not the quality of the research itself. This scale is organized along the same dimensions of a journal article (Title and Abstract, Introduction, Methods, Results, Discussion, Conclusions, and General evaluation). Other scales consisted of scales measuring the completeness of reporting [26] based on reporting guideline items (e.g., CONSORT items). A high score indicates a high quality manuscript.

  2. 2)

    Quality of the peer review report, measured by scales such as the Review Quality Instrument [27] or editor routine quality rating scales [28]. A high score indicates a high quality peer review report. Various versions of the Review Quality Instrument exist [27, 29]. The last version of the scale is based on eight items: importance, originality, methodology, presentation, constrictiveness of comments, substantiation of comments and interpretation of results, and a global item. Each item is scored on a 5-point scale. This version had a high level of agreement (K = 0.83) and a good inter-rater reliability (K = 0.31) in the mean total score. No evidence of floor or ceiling effects was found.

  3. 3)

    Rejection rate (i.e., recommendation by peer reviewers about whether or not the reviewed manuscripts should be rejected from publication in the journal).

  4. 4)

    Time spent on the peer review as reported by peer reviewers. The time spent is important to provide an evaluation of the burden and possible cost of the peer review process.

  5. 5)

    Overall duration of the peer review process.

All outcomes were pre-specified except the time spent on the peer review, which was added after inspection of the studies revealed RCTs evaluating interventions to improve the speed of the peer review process.

Data extraction

Data extraction was performed independently by two researchers who used a standardized data extraction form (Additional file 1). When assessment differed, the item was discussed and consensus was reached. When needed, a third reviewer assessed the report to achieve consensus.

General characteristics of the study recorded included the journal publication, study design, unit of randomization (peer reviewer or manuscripts or both), eligibility criteria for peer reviewers and for manuscripts, number of peer reviewers and/or manuscripts randomized and analyzed, and interventions compared.

The risk of bias within each RCT was assessed by the following domains of the Cochrane Collaboration risk of bias tool [30]: selection bias (methods for random sequence generation and allocation concealment), detection bias (blinding of outcome assessors) and attrition bias (incomplete outcome data). We did not assess performance bias (blinding of participants and providers) because blinding was never possible in this context. Each domain was rated as low, high or unclear risk of bias, according to the guidelines [30]. As recommended by the Cochrane Collaboration, the blinding of outcome assessors and incomplete outcome data domains were assessed at the outcome level. When the risk of bias was unclear, we contacted the authors by email to request more information. When peer reviewers participating in the RCTs were not aware of the study hypothesis or were not aware of the outcome assessed, we considered the detection bias at low risk.

We independently recorded data for each outcome. To avoid selective inclusion of results in systematic reviews, when several results were reported for the selected outcomes (e.g., several scales used to assess the manuscript quality or several time points used to assess the quality of the peer review), we retrieved the results for the outcome reported as the primary outcome in the manuscript or the final or latest time of assessment of the outcome.

For continuous outcomes (i.e., the quality score of the final manuscript, quality of the peer review report, time peer reviewers spent on the peer review), we recorded the mean and standard deviation per arm. If data were not provided in the form of a mean and standard deviation, we derived these from the reported data and produced statistics following the Cochrane handbook recommendations [31].

For dichotomous outcomes (i.e., rejection rate), we recorded the number of manuscripts evaluated and the number of manuscripts for which peer reviewers recommended rejection.

For studies with multiple intervention groups (e.g., blinded peer review and open peer review), we followed the Cochrane recommendations [31] and, when appropriate, we combined all relevant intervention groups for the study into a single group, combined all relevant control intervention groups into a single control group, or selected one pair of interventions and excluded the others.

When required, we extracted the outcome data from the published figures by using Digitizelt [32]. We contacted authors when key data were missing.

Publication bias and small study effect were assessed by visual inspection of funnel plots when appropriate.

Data synthesis

Treatment effect measures were odds ratios (ORs) for binary outcomes (rejection rate) and standardized mean differences (SMDs) for continuous outcomes (quality score of the final manuscript, quality of peer review report, and mean difference for the time peer reviewers spent on the peer review). RCTs measured the quality of the manuscript and the peer review in various ways (using different scoring systems), so we measured the intervention effects by using SMDs.

Because of the diversity of the interventions and settings, we used random-effects meta-analysis (Dersimonian and Laird model). We assessed heterogeneity by visual inspection of the forest plots, Cochrane’s homogeneity test, and the I2 statistic [33]. In the meta-analysis of RCTs assessing open peer review, we performed one subgroup analysis to evaluate the impact of different intensities of open peer review (open to other peer reviewers/open to authors/publicly available). Analyses involved use of Review Manager 5.3 (Cochrane Collaboration).

RCTs assessing training interventions reported a mixture of change from baseline and final value scores regarding the quality of the peer review report. As recommended, we did not combine the final value and change scores together as SMDs [34].

Results

Study selection and characteristics of included studies

Figure 1 provides the flow diagram for the study selection. From the 4592 citations retrieved, we selected 21 published articles reporting 22 RCTs and 25 comparisons, evaluating training interventions (n = 5), the addition of a statistical peer reviewer (n = 2), use of a checklist (n = 2), open peer review (n = 7), blinded peer review (n = 6), and other interventions to increase the speed of the peer review process (n = 3). Only seven RCTs were published since 2004. Searching the WHO trial register platform did not identify any RCT.

Fig. 1
figure 1

Study selection flow diagram

The characteristics of the selected studies are presented in Table 1. The description of the interventions assessed in each RCT is provided in Additional file 2.

Table 1 General characteristics of the trials included

The 22 RCTs were published in 12 different journals. The study design varied according to the interventions evaluated and consisted of randomizing peer reviewers (n = 15), manuscripts (n = 6) or both (n = 4). The manuscripts used in the RCTs were submitted by authors to the system by the usual process (n = 20) or were fictitious manuscripts specifically prepared for the study with error added (n = 2).

Data synthesis

Training

Four RCTs, involving five comparisons and 616 peer reviewers, compared the impact of training interventions to the standard peer review procedure for the given journal on the quality of the peer review report [3538]. Four comparisons of the Annals of Emergency Medicine [3638] evaluated feedback (n = 2), training (n = 1) and mentoring (n = 1), and one comparison of the British Medical Journal evaluated training [35]. Different kinds of training were evaluated: structured workshops with journal editors or self-taught training with a package created especially for peer reviewers. Mentoring consisted of new peer reviewers discussing their review with a senior peer reviewer before sending the review. Feedback consisted of peer reviewers receiving a copy of the editor’s rating of their review. One study evaluated two kinds of interventions: self-taught or face-to-face [35]. For this study, we pooled these two arms versus the usual process. The risk of attrition bias was high for four RCTs (Additional file 3). Training had no impact on the quality of the peer review report, but the heterogeneity across studies was high (66.2 %) and the 95 % confidence interval (95 % CI) relatively large (Fig. 2). Only one study assessed the time to perform the peer review and the rejection rate with training [35]: training did not affect the time to review manuscripts (mean (SD), 116.9 (69.4) min for training vs. 108.5 (70.5) min for the usual process; P = 0.38) – but did increase recommendations to reject manuscripts (OR, 0.47; 95 % CI, 0.42 to 0.78).

Fig. 2
figure 2

Impact of training versus usual process: standardized mean difference (SMD) of the quality of the peer review report

Addition of a statistical peer reviewer

Two RCTs examined the impact of adding a statistical peer reviewer on the final manuscript quality [39, 40]. All RCTs involved the Medicina Clínica journal (a Spanish journal of internal medicine) and were conducted by the same research team. The RCT by Cobo et al. [40] was a 2 × 2 factorial design comparing the addition of a statistical peer reviewer, the use of a checklist, both interventions, and the usual process. For this analysis, we used the arms dedicated to the usual process and the addition of a statistical peer reviewer. The risk of bias was low for these two studies (Additional file 3). Overall, 105 manuscripts were included in the meta-analysis. Adding a statistical peer reviewer increased the quality of the final manuscript as compared to the usual process (combined SMD, 0.58; 95 % CI, 0.19 to 0.98; I2 = 0 %; Fig. 3).

Fig. 3
figure 3

Impact of adding a statistical peer review versus usual process: standardized mean difference (SMD) of the final manuscript quality

Use of a reporting guideline checklist

Two studies evaluated the impact of peer reviewers using a reporting guideline checklist on the quality of the final manuscript [40, 41]. These studies were performed by the same research team and involved the Medicina Clínica journal. For the RCT by Cobo [40], we used the arms dedicated to the usual process and the addition of a checklist. The risk of bias was low (Additional file 3). Overall, 152 manuscripts were included in the meta-analysis. Overall, use of a checklist had no effect on the quality of the final manuscript as compared with the usual process (combined SMD, 0.19; 95 % CI, –0.22 to 0.61; I2 = 38 %; Fig. 4).

Fig. 4
figure 4

Impact of using checklist versus usual process: standardized mean difference (SMD) of the final manuscript quality

“Open” peer review

Seven RCTs assessed the impact of open review interventions [4248], consisting of informing peer reviewers that their identity would be revealed to other peer reviewers only (n = 2), to the authors of the manuscript being reviewed (n = 3), and to the general public via web-posting (n = 1). One study evaluated both open peer review and peer reviewers blinded or not to the authors’ identity [47]. One study was not included in the meta-analysis because the usual process was not the same comparator as for other RCTs [44]. In fact, authors compared signed peer reviews posted on the web versus signed peer reviews for authors and other peer reviewers. A total of 1702 peer reviewers, 1252 manuscripts, and 1182 manuscripts were included in the meta-analysis of the quality of peer review reports, the rejection rate, and the time peer reviewers spent on peer review, respectively. With open peer review, the quality of the peer review report increased (combined SMD, 0.14; 95 % CI, 0.05 to 0.24, I2 = 0 %; Fig. 5a) and the recommendation to reject manuscripts for publication decreased, but the heterogeneity across studies was high (OR, 0.56; 95 % CI, 0.33 to 0.94; I2 = 67 %; Fig. 5b). Subgroup analysis by type of open peer review (i.e., open to other peer reviewers or to authors) did not show evidence of difference. Three of the included studies evaluated the time peer reviewers spent on the peer review for each manuscript (Fig. 5c). Overall, the time peer reviewers spent on the peer review did not differ between the open peer-review and standard peer review groups (combined MD, 0.18; 95 % CI, –0.06 to 0.43, I2 = 50 %). Subgroup analysis by type of open peer review (i.e., open to other peer reviewers or open to authors) did not show evidence of difference.

Fig. 5
figure 5

a Impact of the “open” review interventions versus anonymous process (anonymous to reviewers, authors or public): standardized mean difference (SMD) of the quality of the peer review report. b Impact of the “open” review interventions versus anonymous process (anonymous to reviewers, authors or public): odds ratio (OR) of peer reviewers’ recommendation for rejection. c Impact of the “open” review interventions versus anonymous process (anonymous to reviewers, authors or public): standardized mean difference (SMD) of the time peer reviewers spent on the peer review

Blinded peer review

Among six RCTs evaluating blinded peer review, three evaluated the impact on the quality of the peer review report and three the rejection rate [4752]; 1024 and 564 manuscripts, respectively, were included in the meta-analyses. Blinded peer review did not affect the quality of the peer review report (SMD, 0.12; 95 % CI, –0.12 to 0.36, I2 = 68 %) or the rate of rejection (OR, 0.77; 95 % CI, 0.39 to 1.50, I2 = 69 %), but the confidence interval was large and the heterogeneity among studies substantial (Fig. 6a and b).

Fig. 6
figure 6

a Impact of blinded peer review interventions versus usual process: standardized mean difference (SMD) of the quality of the peer review report. b Impact of blinded peer review interventions versus usual process: odds ratio (OR) of peer reviewers’ recommendation for rejection

Interventions to accelerate the peer review process

For three RCTs evaluating interventions to improve the duration of the peer review process [5355], the interventions consisted of first asking the peer reviewers whether they agreed to review the manuscript versus directly sending the manuscript to review [53] or calling peer reviewers versus sending the manuscript [55]. Finally, the mean duration of the peer review process was decreased with an intervention based on early screening of manuscripts by editors versus the usual process [54]. We could not perform a meta-analysis because outcomes were incompletely reported.

Discussion

Despite the essential role of peer review in the biomedical research enterprise, our systematic review identified only 22 RCTs evaluating interventions to improve the peer review quality; results of only seven RCTs were published over the past 10 years. To our knowledge, no ongoing trial is registered at trial registries. Our results provide little evidence of an effect of training of peer reviewers and the use of a checklist on the quality of the peer review report and the final manuscript. However, the amount of evidence is scarce and the methodological quality raises some concerns. Only two RCTs evaluated the impact of adding a peer reviewer with statistical expertise and showed favorable results on the quality of the final manuscript. However, these studies were performed in a single journal and need to be reproduced in other contexts. Open peer review, routinely implemented by several journals such as the British Medical Journal or BioMed Central journals, had a small favorable impact on the quality of the peer review report, increased the time peer reviewers spent on the review, and decreased recommendations to reject manuscripts. Blinded peer review i.e. peer reviewers are blinded to author names and affiliation did not affect the rate of rejection. The goal of blinding peer review is not to change the quality of the review but to favor objective and fair review. However, evaluating whether a review is fair and objective is a difficult or even impossible task. This explains why researchers focused on the quality of the report and the rejection rate. Nevertheless, the lack of difference observed in these outcomes should not be interpreted as a lack of usefulness of blinding. Finally, interventions to improve the speed of the process were heterogeneous.

The first International Congress on Peer Review in Biomedical Publication was held in Chicago in 1989 [56], and aimed to stimulate research on peer review. In 2007, Jefferson et al. [23] published a Cochrane systematic review on editorial peer review and concluded that a large, well-funded program of research on the effects of editorial peer review was urgently needed. Several years later, despite some initiatives such as the European COST action PEERE [57], research involving experimental design is scarce in this field, which is alarming considering the cost and central role of this process in biomedical science.

The methodological problems in studying peer review are many and complex. First, the primary functions of peer review are poorly defined. According to the editor, the role of peer reviewer varies from assessing the novelty of the study results, to assessing the scientific rigor, to assessing the clinical relevance, etc. [7]. The choice of the design is also difficult. In our review, the study designs involved the randomization of peer reviewers to assess a sample of manuscripts submitted to a journal or fabricated manuscripts developed for the RCT or the randomization of manuscripts to be assessed by a sample of peer reviewers. The selection of journals, peer reviewers and manuscripts will affect the applicability of the study results. Most RCTs were performed in a single journal and limited the generalizability of results to very specific contexts. Journal culture, standard review procedures, sub-specialties, and the experience of peer reviewers all play a role in the changed quality of reviews observed. For example, statistical peer review was studied in only one publication (Medicina Clínica), in three RCTs conducted by the same study team; although the results are reliable, the external validity is limited.

Bias related to the lack of blinding is also problematic when designing these RCTs. Concealing study aims and procedures from study subjects is difficult and even impossible for some of these interventions. For instance, for RCTs of participants randomized to be blinded to author identity in blinding interventions, unintentional discovery or speculation of the author’s identity was reported. Similarly, due to controversial opinions of open peer review, some of the studies for open peer-review interventions were subject to bias because of a large number of potential subjects declining to participate or not completing the allocated peer review.

The choice of relevant outcomes is also difficult in this domain. The outcomes assessed were mainly the editor’s subjective assessment or use of validated scales to assess the quality of the peer review report [3538] or the final manuscript [3941]. However, the choice of the criteria to consider the appropriateness of a peer review report can be debatable and can vary according to the editor in charge. For example, a study showed a discrepancy between editors and peer reviewers regarding items considered most important when assessing the report of an RCT [58]. Further, the validity of some of these outcomes (e.g., editors’ routine scales) and the minimal editorial relevant difference could be questioned. Likewise, the association between recommendations to reject and the quality of the peer review is not well understood, and the proposed goal of these interventions in terms of rejection rate is not well defined. A high rejection rate could be related to the quality of the manuscript but also to the interest of the topic or to peer reviewer subjectivity. In our review, no studies evaluated the association between a recommendation of rejection and the quality of the published study. Finally, the quality of the final manuscript is not solely the results of the peer reviewers’ assessment. In fact, some relevant peer reviewer suggestions might not have been acted upon, either because the authors refused or the editor did not enforce them. Of note, articles included in our systematic review evaluated the final manuscript quality after a single round of peer review. Further, for some journals, copy editing will also impact the article quality.

Our systematic review has some limitations. We selected only RCTs performed in the field of biomedical research and we cannot exclude that some interventions were developed and evaluated in other fields. We did not search some more specific bibliographic databases such as Cumulative Index to Nursing and Allied Health Literature (CINAHL). However, our search was extensive and performed according to Cochrane standards. The limited number of RCTs identified, their small sample sizes, their methodological quality, and their applicability limit the interpretation of our results. Finally, reporting bias can be a concern in systematic reviews and because of the limited number of RCTs, we could not explore reporting bias. However, in this particular area, both negative and positive results are of interest to the peer-reviewed academic journal community. It is unlikely that additional RCTs meeting inclusion criteria were unreported or that results from the RCTs were selectively omitted. Still, the limitations of this body of evidence prohibit our ability to make definitive general recommendations regarding their implementation in peer review editorial processes for biomedical publications.

Conclusions

Currently, we cannot provide conclusive recommendations on the use of interventions to improve quality of peer review from this body of evidence and its limitations. Since the publication of a previous systematic review on the topic [23], even with the inclusion of a number of more recent RCTs, the state of the evidence falls short of generating empirical support. Editorial boards and decision-makers of the publication process structure should remain aware of the uncertainty in these intervention methods and weigh the shortcomings and practical challenges before implementing them. These results highlight the urgent need to clarify the goal of the peer review process, the definition of a good quality peer review report, and the outcomes that should be used. In the longer term, these results reiterate the need for additional experimentation on this topic and exploration into the drivers of publication quality. In fact, the evidence base for recommendations needs to be strengthened. Further research is needed to identify interventions that could improve the process and assess those interventions in well-performed trials.

Abbreviations

MD, mean difference; OR, odds ratio; RCT, randomized controlled trials; SMD, standardized mean difference.