Surgical Endoscopy

, Volume 26, Issue 8, pp 2275–2280

Is expert peer review obsolete? A model suggests that post-publication reader review may exceed the accuracy of traditional peer review

Article

DOI: 10.1007/s00464-012-2171-1

Cite this article as:
Herron, D.M. Surg Endosc (2012) 26: 2275. doi:10.1007/s00464-012-2171-1

Abstract

Background

The peer review process is the gold standard by which academic manuscripts are vetted for publication. However, some investigators have raised concerns regarding its unopposed supremacy, including lack of expediency, susceptibility to editorial bias and statistical limitation due to the small number of reviewers used. Post-publication review—in which the article is assessed by the general readership of the journal instead of a small group of appointed reviewers—could potentially supplement or replace the peer-review process. In this study, we created a computer model to compare the traditional peer-review process to that of post-publication reader review.

Methods

We created a mathematical model of the manuscript review process. A hypothetical manuscript was randomly assigned a “true value” representing its intrinsic quality. We modeled a group of three expert peer reviewers and compared it to modeled groups of 10, 20, 50, or 100 reader-reviewers. Reader-reviewers were assumed to be less skillful at reviewing and were thus modeled to be only ¼ as accurate as expert reviewers. Percentage of correct assessments was calculated for each group.

Results

400,000 hypothetical manuscripts were modeled. The accuracy of the reader-reviewer group was inferior to the expert reviewer group in the 10-reviewer trial (93.24% correct vs. 97.67%, p < 0.0001) and the 20-reviewer trial (95.50% correct, p < 0.0001). However, the reader-reviewer group surpassed the expert reviewer group in accuracy when 50 or 100 reader-reviewers were used (97.92 and 99.20% respectively, p < 0.0001).

Conclusions

In a mathematical model of the peer review process, the accuracy of public reader-reviewers can surpass that of a small group of expert reviewers if the group of public reviewers is of sufficient size. Further study will be required to determine whether the mathematical assumptions of this model are valid in actual use.

Keywords

Peer review Expert review Post-publication review Computer model 

Academic journals universally accept the peer review process as the gold standard by which manuscripts are evaluated for publication. The perception of the process’s reliability is so pervasive that the term “peer-reviewed” is considered to be synonymous with quality. However, in recent years some investigators have raised concerns regarding its unopposed supremacy. Foremost is the concern expressed in a Cochrane review of 2002 that “editorial peer review, although widely used, is largely untested and its effects are uncertain” [1].

Many of the concerns regarding peer review stem from the small number of reviewers participating in the evaluation of any given manuscript [2]. A 1981 study of 150 proposals submitted to the National Science Foundation found so much disagreement among reviewers that grant selection “depend[ed] in a large proportion of cases upon which reviewers happened to be selected for it” [3]. A more recent study of abstract and manuscript reviewers in clinical neuroscience found that agreement amongst two or more reviewers regarding acceptability was no greater than that expected by chance [4]. A similar lack of inter-referee agreement has been identified in basic science literature [5, 6]. Perhaps such lack of agreement is to be expected, since a typical journal may use only 2–3 peer reviewers per manuscript [7].

To address these concerns, some investigators have suggested that post-publication open review—in which the quality of an article is judged by the general readership of the journal instead of a small group of reviewers—be used to supplement or replace the peer-review process [8, 9, 10]. Such an approach to the “scoring” of materials is well-accepted by the lay public, where consumers rate products, books, and media on the Internet at sites like Amazon and YouTube. Prepublication online posting is currently used in fields such as mathematics, physics and computer science through ArXiv.org, an online pre-print archive hosted by the Cornell University Library [11]. Use of a similar system in which registered subscribers to an academic medical journal serve as volunteer reviewers could potentially improve the quality of reviews, by significantly increasing the number of reviewers and thus improving the statistical validity of a given manuscript’s score. On the other hand, opening the review process to the general readership who have may have little or no formal experience in manuscript evaluation could potentially lead to low-quality or biased reviews and thus substantially degrade the evaluation process.

We hypothesized that post-publication open review will equal or exceed traditional peer-review in accuracy if the number of reviewers is high enough. To evaluate this hypothesis we created a mathematical model of the review process and compared the accuracy of a small group of simulated expert peer reviewers to that of a much larger group of non-expert “reader-reviewers.”

Methods

The model was created within Excel (Microsoft Office Excel 2007, Microsoft, Redmond, WA) (Fig. 1). All “random” numbers were generated using the RAND() pseudo-random number generation algorithm. While not providing true random numbers, the RAND() function is guaranteed not to generate repeating patterns in fewer than 1013 cycles, and passes the standard tests of randomness including the DIEHARD and additional tests developed by the National Institute of Standards and Technology [12].
Fig. 1

Summary of the modeling process. Each simulated manuscript was scored by three modeled expert reviewers and 10, 20, 50, or 100 modeled reader-reviewers. Scores were combined and correctness of the final acceptance decision was determined. This process was repeated 100,000 times each for the 10, 20, 50, and 100 reader-reviewer groups

A hypothetical manuscript was randomly assigned a “true score” of 1–10 points, representing 10 deciles of quality, with 1 being worst and 10 being best. Scores were linearly distributed, creating roughly equal numbers of manuscripts in each score category. The model simulated a moderately selective journal that accepts 30% of submitted manuscripts. Hence, a score of 8, 9 or 10 was required for a manuscript to be considered acceptable.

Reviewer scores were modeled by taking the true score and adding 2 error factors: an “imprecision error” and “other error” (Fig. 2). Imprecision error represented the inability of a reviewer to objectively assess the true quality of the manuscript, while other error accounted for skewing factors such as reviewer preconceptions, bias, etc. Both types of error were modeled in a normal distribution; to achieve this, we used the NORMSINV() function to create a normal distribution with a predetermined standard deviation.1 A precise group of reviewers would be expected to have a narrow distribution with a low standard deviation while less precise reviewers would have a wide distribution with a higher standard deviation. While it would have been mathematically identical to use a single error factor with a wider normal distribution, we chose to explicitly acknowledge that both unintentional errors and intentional bias will contribute to scoring error.
Fig. 2

Modeling of scores for expert reviewers and reader-reviewers. Modeled scores were generated by adding two error factors to the simulated manuscript’s true score. Imprecision error represents the failure of a reviewer to accurately assess the value of the manuscript, while other error includes unrelated sources of error such as reviewer preconceptions or bias. Both types of error were normally distributed. Expert reviewers were modeled to be four times more accurate than reader-reviewers with regard to both types of error. This was achieved by setting the modeled standard deviation of imprecision error to 0.5 point for the expert reviewer group and 2.0 points for the reader-reviewer group. Standard deviation for other error was set to 0.25 point for the expert reviewers and 1.0 point for the reader-reviewers

We modeled the standard deviation of imprecision error for the expert reviewer group at 0.5 points, implying that approximately 2/3 of expert reviewer scores would fall within one half of a point, or 5% (one standard deviation) of the true score. We modeled the standard deviation for other error one-half as great, at 0.25 points, so that approximately 2/3 of expert reviewers would have other error of 2.5% or less. We felt that these assumptions represented an extremely high level of accuracy for the expert reviewers, significantly higher than what might be expected in reality. This assumption is supported by Schroter et al.’s 2008 study [13] in which 607 reviewers at the British Medical Journal were sent 3 test papers that contained 9 major errors. On average, reviewers identified fewer than 1/3 of the errors.

In contrast, we assumed that reader-reviewers would be four times less accurate than expert reviewers with regard to both imprecision error and other error. To model this, we set the standard deviation for both imprecision error and other error 4 times higher, at 2.0 points and 1.0 points respectively. All final scores were constrained within a range of 1–10. Modeling of the reviewer scores is summarized in Fig. 2.

Determination of acceptance

Expert reviewer group

For each simulated manuscript, the expert reviewer modeling process was repeated three times to simulate three expert reviewers. The three scores were combined to reach a final acceptance decision using both the “averaging method” and the “voting method” (Fig. 3). In the averaging method, the mean of the three reviewer scores was calculated, and the manuscript was accepted if the mean score was equal to or above the cutoff score of 8. In the voting method, the manuscript was accepted if a minimum number of simulated reviewers gave it an acceptable score (i.e. ≥8 points). For the voting method, we empirically determined the optimal minimum number of votes that would maximize the accuracy of the final acceptance decision. A final decision to accept was considered “correct” if the manuscript’s true score was equal to or above the cutoff value (8 points) and “incorrect” otherwise. Conversely, a decision to reject was considered correct if the true score was beneath the cutoff score and incorrect otherwise.
Fig. 3

The two methods used to combine modeled reviewer scores to make a final decision to accept or reject. In the averaging method, the mean of all the reviewers’ scores is calculated and used to make the decision. In the voting method, each reviewer contributes an “accept” or “reject” vote based on their individual scoring; a minimum number of “accept” votes is required for a final decision to accept (in this example, 1). In the example depicted above, the averaging method leads to a correct decision to reject this manuscript since the true score is only 7, beneath the cutoff of 8. The voting method incorrectly leads to a decision to accept

Reader-reviewer group

For each simulated manuscript, the reviewer modeling process was repeated over 10, 20, 50 or 100 iterations to simulate a corresponding number of reader-reviewers. As with the expert reviewer group, both the averaging and the voting methods were used to combine the reviewer scores into a single decision to accept or reject. For the averaging method, the mean of the 10, 20, 50 or 100 reviewer scores was calculated and the manuscript accepted if the mean score was equal to or above the cutoff score of 8. For the voting method for each group, the optimal minimal number of votes to accept was empirically determined, as with the expert reviewer group; a manuscript was accepted if it garnered at least this number of acceptable scores.

Number of iterations of the model

The above process describes the modeling of a single hypothetical manuscript. In the model, a data table was created in which the above process was repeated 10,000 times. This number of iterations was constrained by memory requirements of the spreadsheet. The spreadsheet was then recalculated 10 times for a total of 100,000 iterations, and overall results compiled. Percentage of correct acceptance decisions was calculated for each group and statistical significance was determined using binomial distribution.

Results

Expert reviewer group

Results of the model are shown in Table 1. Using the averaging method to combine reviewer scores, the expert reviewer group made a correct decision regarding manuscript acceptance 94.99 ± 0.07% of the time. For the voting method, we empirically determined the optimal value for the number of accept votes required by the reviewers by running the model with all possible values (i.e. 1, 2 or 3 votes required for manuscript acceptance) and determining which value gave the highest overall decision accuracy. When fewer votes were required, there was a greater chance of accepting a bad manuscript, but a lower chance of rejecting a good manuscript. The converse was true when more votes were required. Somewhat surprisingly, the accuracy of the expert reviewer group was optimized when the number of votes required was set at 1; that is, a manuscript was accepted when one or more of the three expert reviewers granted it a score of 8 or better. With this approach, the expert group was correct 97.67 ± 0.02% of the time. As expected, the accuracy of the three-expert-reviewer group did not vary significantly from this value in different runs of the computer model (i.e. the 10, 20, 50 and 100 reader-reviewer models).
Table 1

Percent of correct determinations by expert reviewers and reader-reviewers

No. of reader-reviewers

% Correct by three expert reviewers

% Correct by reader-reviewers

Averaging

Voting

Averaging

p

Voting

p

10

95.02

97.69

91.51

<0.0001

93.24

<0.0001

20

94.90

97.69

92.14

<0.0001

95.50

<0.0001

50

95.00

97.65

91.80

<0.0001

97.92

<0.0001

100

95.06

97.64

91.22

<0.0001

99.20

<0.0001

Mean

94.99

97.67

91.67 

 

96.46

 

Std dev

0.07

0.02

0.40 

 

2.64

 

Each line describes results of 100,000 iterations of the model in which three expert reviewers were compared to 10, 20, 50 or 100 reader-reviewers. p values describe significance of the reader-reviewer score as compared to the expert reviewer score using the voting method

Reader-reviewer group

Using the averaging method, the accuracy of the reader-reviewer group remained consistent, with an overall final decision correctness of 91.67 ± 0.40%. With this method, no improvement in accuracy was noted as the reviewer group increased in size (i.e. from 10 to 20, 50 and 100 reader-reviewers). Accuracy with this method was significantly less than that obtained by the expert reviewer group (p < 0.0001).

When the voting method was used, the optimal number of required votes was determined empirically for each of the groups. For the 10, 20, 50 and 100-reader-reviewer groups, the optimal number of votes to accept was found to 5, 9, 21, and 42, respectively. These values correspond to approximately 42% of the total number of reviewers.

With the voting method, a substantially more accurate final decision was obtained compared to the averaging method. Additionally, with this approach the accuracy was found to increase proportional to the size of the reviewer group. While the final decision accuracy of the 10- and 20-reader-reviewer groups (93.24 and 95.50%) remained lower than that of the expert reviewer group, the accuracy of the 50- and 100-reader-reviewer groups (97.92 and 99.20%) significantly exceeded that of the expert reviewers (p < 0.0001).

Discussion

The “wisdom of crowds” was first noted by the English scientist Francis Galton [14] in a 1907 article entitled Vox Populi in which he reported the results of a contest at an English livestock show where contestants were asked to guess the weight of a publicly displayed ox. After sorting the 787 entries by weight, Galton found that the median estimate of 1,207 pounds differed from the true weight of 1,198 pounds by less than 1%. He noted in a later publication that the mean of all the entries, 1,197 pounds, differed from the true weight by only 1 pound [15]. This premise became the subject of a popular 2004 book by James Surowiecki entitled The Wisdom of Crowds [16]. While Surowiecki acknowledged that not all crowds may be wise, he suggested that if a crowd presents a diversity of independent opinions that draw on individual knowledge, this information may be aggregated into a common understanding of high accuracy. In this manner, a journal could use the collective intelligence of its readership to review its submissions in a reliable and accurate manner.

With the Internet as the facilitating mechanism, such aggregations of popular knowledge are widely used and accepted by the general public; any individual who has purchased a book or other product online has almost certainly relied upon public non-expert review (i.e. customer reviews) to inform their decision. In an analogous fashion, scientific journals could make a manuscript publicly accessible and available for review by its readership base. Such a mechanism of “reader review” is easily adaptable to the scientific journal and could enhance or perhaps ultimately replace the traditional process of peer review in which a very small number of expert reviewers and editors determine a manuscript’s merit.

Such an approach was briefly employed by Nature in 2006, when a trial of open peer review was offered to authors submitting manuscripts during a 4-month period [17]. While groundbreaking, the experiment was generally perceived as unsuccessful. Of 1,369 papers submitted for review, only 71 authors (5%) agreed to participate, and the subsequent reader feedback was of limited volume and utility. Ultimately, the editors chose not to further implement open peer review.

In a slightly different model of open submission without review, ArΧiv.org has achieved considerable success. An online archive of manuscripts in physics, mathematics, computer science, quantitative biology, quantitative finance and statistics, ArXiv.org is hosted by Cornell University Library and currently contains over 650,000 manuscripts [11]. The forum is not peer-reviewed but allows manuscripts to be immediately available worldwide and is often used for public review prior to publication in traditional peer-reviewed journals.

It warrants emphasis that the model presented here is nothing more than a mathematical construct, and that any mathematical model is only as good as its assumptions. One implicit assumption of this study is that a manuscript truly has an “inherent value.” While many aspects of an article may be objectively assessed—for example, its statistical methodologies—others such as its clinical impact or interest to the readership may be different to different readers. However, this concern would further support reader-review, since post-publication review would decrease the chances that a valuable finding would remain unpublished because a small group of experts did not appreciate its worth.

Another critical aspect of our model is the choice of error values for the expert reviewers and the reader-reviewers. How do we know if the error distributions in any way mirror reality? Is there any validity to the assumption that reader-reviewers are only ¼ as accurate as experts? Are they better? Are they worse? Empiric data to support these choices are not available now and probably never will be, so these assumptions must remain educated guesses. Nonetheless, the assumptions made in this model were made to give the benefit of the doubt to the established, peer-review approach. Even so, the reader-review model proved to be superior when enough reviewers participated.

John Hall [18], the editor of the ANZ Journal of Surgery, has authored a remarkable series of articles entitled “How to Dissect Surgical Journals”. In the second article of the series entitled “The Publishing Enterprise,” he describes the peer-review process:

Any belief that peer review is a fair and consistent process is utopian. Smith et al. sent 221 reviewers for the British Medical Journal a paper that contained 8 serious flaws [19]. The median number of flaws identified by reviewers was 2—none spotted more than 5. Jefferson et al. performed a systematic review and failed to identify convincing evidence that peer review improved the quality of manuscripts markedly, either in content or readability [1]. Nevertheless, the peer review process does tend to select the better articles for publication; and, flawed as it is, there is no better alternative.

The mathematical model presented here suggests that post-publication reader review may hold significant value. Even assuming that reader-reviewers are only one-quarter as accurate as expert peer-reviewers, the potentially higher numbers of reviews leads to statistical averaging that may exceed 90% accuracy with only 10 reviewers and approaches 98% accuracy with 50 reviewers or more. Additionally, the model suggests that a simple “voting” model in which a user casts a pro or con vote may exceed the accuracy of a more complex system in which scores are averaged. Despite poor acceptance of the system when trialed by Nature in 2006, such an approach may merit a second look.

Footnotes
1

To achieve a random number in a normal distribution with a standard deviation of n, we used the following Excel formula: NORMSINV(RAND()) * n.

 

Acknowledgments

Special Thanks are extended to Dr. David Urbach, MD, MSc, FACS, FRCSC from the Departments of Surgery and Health Policy in the University of Toronto for his assistance with details of the experimental model.

Disclosure

Dr. Herron holds stock options in Hourglass Technology.

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Mount Sinai School of MedicineNew YorkUSA