Is expert peer review obsolete? A model suggests that post-publication reader review may exceed the accuracy of traditional peer review
- First Online:
- Cite this article as:
- Herron, D.M. Surg Endosc (2012) 26: 2275. doi:10.1007/s00464-012-2171-1
- 431 Downloads
The peer review process is the gold standard by which academic manuscripts are vetted for publication. However, some investigators have raised concerns regarding its unopposed supremacy, including lack of expediency, susceptibility to editorial bias and statistical limitation due to the small number of reviewers used. Post-publication review—in which the article is assessed by the general readership of the journal instead of a small group of appointed reviewers—could potentially supplement or replace the peer-review process. In this study, we created a computer model to compare the traditional peer-review process to that of post-publication reader review.
We created a mathematical model of the manuscript review process. A hypothetical manuscript was randomly assigned a “true value” representing its intrinsic quality. We modeled a group of three expert peer reviewers and compared it to modeled groups of 10, 20, 50, or 100 reader-reviewers. Reader-reviewers were assumed to be less skillful at reviewing and were thus modeled to be only ¼ as accurate as expert reviewers. Percentage of correct assessments was calculated for each group.
400,000 hypothetical manuscripts were modeled. The accuracy of the reader-reviewer group was inferior to the expert reviewer group in the 10-reviewer trial (93.24% correct vs. 97.67%, p < 0.0001) and the 20-reviewer trial (95.50% correct, p < 0.0001). However, the reader-reviewer group surpassed the expert reviewer group in accuracy when 50 or 100 reader-reviewers were used (97.92 and 99.20% respectively, p < 0.0001).
In a mathematical model of the peer review process, the accuracy of public reader-reviewers can surpass that of a small group of expert reviewers if the group of public reviewers is of sufficient size. Further study will be required to determine whether the mathematical assumptions of this model are valid in actual use.
KeywordsPeer review Expert review Post-publication review Computer model
Academic journals universally accept the peer review process as the gold standard by which manuscripts are evaluated for publication. The perception of the process’s reliability is so pervasive that the term “peer-reviewed” is considered to be synonymous with quality. However, in recent years some investigators have raised concerns regarding its unopposed supremacy. Foremost is the concern expressed in a Cochrane review of 2002 that “editorial peer review, although widely used, is largely untested and its effects are uncertain” .
Many of the concerns regarding peer review stem from the small number of reviewers participating in the evaluation of any given manuscript . A 1981 study of 150 proposals submitted to the National Science Foundation found so much disagreement among reviewers that grant selection “depend[ed] in a large proportion of cases upon which reviewers happened to be selected for it” . A more recent study of abstract and manuscript reviewers in clinical neuroscience found that agreement amongst two or more reviewers regarding acceptability was no greater than that expected by chance . A similar lack of inter-referee agreement has been identified in basic science literature [5, 6]. Perhaps such lack of agreement is to be expected, since a typical journal may use only 2–3 peer reviewers per manuscript .
To address these concerns, some investigators have suggested that post-publication open review—in which the quality of an article is judged by the general readership of the journal instead of a small group of reviewers—be used to supplement or replace the peer-review process [8, 9, 10]. Such an approach to the “scoring” of materials is well-accepted by the lay public, where consumers rate products, books, and media on the Internet at sites like Amazon and YouTube. Prepublication online posting is currently used in fields such as mathematics, physics and computer science through ArXiv.org, an online pre-print archive hosted by the Cornell University Library . Use of a similar system in which registered subscribers to an academic medical journal serve as volunteer reviewers could potentially improve the quality of reviews, by significantly increasing the number of reviewers and thus improving the statistical validity of a given manuscript’s score. On the other hand, opening the review process to the general readership who have may have little or no formal experience in manuscript evaluation could potentially lead to low-quality or biased reviews and thus substantially degrade the evaluation process.
We hypothesized that post-publication open review will equal or exceed traditional peer-review in accuracy if the number of reviewers is high enough. To evaluate this hypothesis we created a mathematical model of the review process and compared the accuracy of a small group of simulated expert peer reviewers to that of a much larger group of non-expert “reader-reviewers.”
A hypothetical manuscript was randomly assigned a “true score” of 1–10 points, representing 10 deciles of quality, with 1 being worst and 10 being best. Scores were linearly distributed, creating roughly equal numbers of manuscripts in each score category. The model simulated a moderately selective journal that accepts 30% of submitted manuscripts. Hence, a score of 8, 9 or 10 was required for a manuscript to be considered acceptable.
We modeled the standard deviation of imprecision error for the expert reviewer group at 0.5 points, implying that approximately 2/3 of expert reviewer scores would fall within one half of a point, or 5% (one standard deviation) of the true score. We modeled the standard deviation for other error one-half as great, at 0.25 points, so that approximately 2/3 of expert reviewers would have other error of 2.5% or less. We felt that these assumptions represented an extremely high level of accuracy for the expert reviewers, significantly higher than what might be expected in reality. This assumption is supported by Schroter et al.’s 2008 study  in which 607 reviewers at the British Medical Journal were sent 3 test papers that contained 9 major errors. On average, reviewers identified fewer than 1/3 of the errors.
In contrast, we assumed that reader-reviewers would be four times less accurate than expert reviewers with regard to both imprecision error and other error. To model this, we set the standard deviation for both imprecision error and other error 4 times higher, at 2.0 points and 1.0 points respectively. All final scores were constrained within a range of 1–10. Modeling of the reviewer scores is summarized in Fig. 2.
Determination of acceptance
Expert reviewer group
For each simulated manuscript, the reviewer modeling process was repeated over 10, 20, 50 or 100 iterations to simulate a corresponding number of reader-reviewers. As with the expert reviewer group, both the averaging and the voting methods were used to combine the reviewer scores into a single decision to accept or reject. For the averaging method, the mean of the 10, 20, 50 or 100 reviewer scores was calculated and the manuscript accepted if the mean score was equal to or above the cutoff score of 8. For the voting method for each group, the optimal minimal number of votes to accept was empirically determined, as with the expert reviewer group; a manuscript was accepted if it garnered at least this number of acceptable scores.
Number of iterations of the model
The above process describes the modeling of a single hypothetical manuscript. In the model, a data table was created in which the above process was repeated 10,000 times. This number of iterations was constrained by memory requirements of the spreadsheet. The spreadsheet was then recalculated 10 times for a total of 100,000 iterations, and overall results compiled. Percentage of correct acceptance decisions was calculated for each group and statistical significance was determined using binomial distribution.
Expert reviewer group
Percent of correct determinations by expert reviewers and reader-reviewers
No. of reader-reviewers
% Correct by three expert reviewers
% Correct by reader-reviewers
Using the averaging method, the accuracy of the reader-reviewer group remained consistent, with an overall final decision correctness of 91.67 ± 0.40%. With this method, no improvement in accuracy was noted as the reviewer group increased in size (i.e. from 10 to 20, 50 and 100 reader-reviewers). Accuracy with this method was significantly less than that obtained by the expert reviewer group (p < 0.0001).
When the voting method was used, the optimal number of required votes was determined empirically for each of the groups. For the 10, 20, 50 and 100-reader-reviewer groups, the optimal number of votes to accept was found to 5, 9, 21, and 42, respectively. These values correspond to approximately 42% of the total number of reviewers.
With the voting method, a substantially more accurate final decision was obtained compared to the averaging method. Additionally, with this approach the accuracy was found to increase proportional to the size of the reviewer group. While the final decision accuracy of the 10- and 20-reader-reviewer groups (93.24 and 95.50%) remained lower than that of the expert reviewer group, the accuracy of the 50- and 100-reader-reviewer groups (97.92 and 99.20%) significantly exceeded that of the expert reviewers (p < 0.0001).
The “wisdom of crowds” was first noted by the English scientist Francis Galton  in a 1907 article entitled Vox Populi in which he reported the results of a contest at an English livestock show where contestants were asked to guess the weight of a publicly displayed ox. After sorting the 787 entries by weight, Galton found that the median estimate of 1,207 pounds differed from the true weight of 1,198 pounds by less than 1%. He noted in a later publication that the mean of all the entries, 1,197 pounds, differed from the true weight by only 1 pound . This premise became the subject of a popular 2004 book by James Surowiecki entitled The Wisdom of Crowds . While Surowiecki acknowledged that not all crowds may be wise, he suggested that if a crowd presents a diversity of independent opinions that draw on individual knowledge, this information may be aggregated into a common understanding of high accuracy. In this manner, a journal could use the collective intelligence of its readership to review its submissions in a reliable and accurate manner.
With the Internet as the facilitating mechanism, such aggregations of popular knowledge are widely used and accepted by the general public; any individual who has purchased a book or other product online has almost certainly relied upon public non-expert review (i.e. customer reviews) to inform their decision. In an analogous fashion, scientific journals could make a manuscript publicly accessible and available for review by its readership base. Such a mechanism of “reader review” is easily adaptable to the scientific journal and could enhance or perhaps ultimately replace the traditional process of peer review in which a very small number of expert reviewers and editors determine a manuscript’s merit.
Such an approach was briefly employed by Nature in 2006, when a trial of open peer review was offered to authors submitting manuscripts during a 4-month period . While groundbreaking, the experiment was generally perceived as unsuccessful. Of 1,369 papers submitted for review, only 71 authors (5%) agreed to participate, and the subsequent reader feedback was of limited volume and utility. Ultimately, the editors chose not to further implement open peer review.
In a slightly different model of open submission without review, ArΧiv.org has achieved considerable success. An online archive of manuscripts in physics, mathematics, computer science, quantitative biology, quantitative finance and statistics, ArXiv.org is hosted by Cornell University Library and currently contains over 650,000 manuscripts . The forum is not peer-reviewed but allows manuscripts to be immediately available worldwide and is often used for public review prior to publication in traditional peer-reviewed journals.
It warrants emphasis that the model presented here is nothing more than a mathematical construct, and that any mathematical model is only as good as its assumptions. One implicit assumption of this study is that a manuscript truly has an “inherent value.” While many aspects of an article may be objectively assessed—for example, its statistical methodologies—others such as its clinical impact or interest to the readership may be different to different readers. However, this concern would further support reader-review, since post-publication review would decrease the chances that a valuable finding would remain unpublished because a small group of experts did not appreciate its worth.
Another critical aspect of our model is the choice of error values for the expert reviewers and the reader-reviewers. How do we know if the error distributions in any way mirror reality? Is there any validity to the assumption that reader-reviewers are only ¼ as accurate as experts? Are they better? Are they worse? Empiric data to support these choices are not available now and probably never will be, so these assumptions must remain educated guesses. Nonetheless, the assumptions made in this model were made to give the benefit of the doubt to the established, peer-review approach. Even so, the reader-review model proved to be superior when enough reviewers participated.
Any belief that peer review is a fair and consistent process is utopian. Smith et al. sent 221 reviewers for the British Medical Journal a paper that contained 8 serious flaws . The median number of flaws identified by reviewers was 2—none spotted more than 5. Jefferson et al. performed a systematic review and failed to identify convincing evidence that peer review improved the quality of manuscripts markedly, either in content or readability . Nevertheless, the peer review process does tend to select the better articles for publication; and, flawed as it is, there is no better alternative.
The mathematical model presented here suggests that post-publication reader review may hold significant value. Even assuming that reader-reviewers are only one-quarter as accurate as expert peer-reviewers, the potentially higher numbers of reviews leads to statistical averaging that may exceed 90% accuracy with only 10 reviewers and approaches 98% accuracy with 50 reviewers or more. Additionally, the model suggests that a simple “voting” model in which a user casts a pro or con vote may exceed the accuracy of a more complex system in which scores are averaged. Despite poor acceptance of the system when trialed by Nature in 2006, such an approach may merit a second look.
To achieve a random number in a normal distribution with a standard deviation of n, we used the following Excel formula: NORMSINV(RAND()) * n.
Special Thanks are extended to Dr. David Urbach, MD, MSc, FACS, FRCSC from the Departments of Surgery and Health Policy in the University of Toronto for his assistance with details of the experimental model.
Dr. Herron holds stock options in Hourglass Technology.